34577 >> Bryan Parno: Good morning everybody then thank you...

34577 >> Bryan Parno: Good morning everybody then thank you for joining us. We're very fortunate to have Professor Rob Johnson visiting us here. I know him from when he was a grad student at Berkeley. The very first conference I went to I was trying to save money and he had a spare bed in the room he was renting for the conference, so he looked out for me when I was a very young grad student. It's great to have him here. I'm familiar with his work in the security world where he has done some very cool stuff on language-based security looking for vulnerabilities in the Linux kernel. But lately he's been looking at optimized data structures, typically, for highperformance systems and in this case file systems. Rob. >> Rob Johnson: Thanks Bryan. Good morning and thanks for having me here. This is a talk based on BetrFS which is a project has a lot of authors from a lot of different institutions. I'm actually not going to name everybody but it's a collaboration between Stony Brook which is where I'm from now. Tokutek Incorporated is a startup by some of my co-authors, Rutgers and MIT. This is an expanded version of a talk from FAST, Linux FAST. My goal here is to communicate information, so feel free to ask me questions and stop me at any time because I really want to teach you about what optimized data structures are if you don't already know them, and how they can benefit file system design, but also how they impact file system and system design. They are kind of a radical departure from the data structure that we are used to. Let me begin by explaining the motivation. If you take an off-the-shelf system, something like ext 4 which is kind of the default file system in most Linux installations and you run a benchmark where you just do a gigabyte of sequential writes and you run it on a spinning magnetic disk and the desk is rated for 125 megabytes per second. You'll see that ext 4 gets a throughput of about 104 megabytes per second. For a sequential write load ext 4 does great. It's getting most of the disk bandwidth out of the disk. On the other hand, if you do a random write workload, same file system, same disk but you're doing a small random write over a 1 gigabyte file, the throughput of the ext 4 is only about 1.5 megabytes per second. It's hardly getting any of the performance out of the disk. It's really wasting a lot of potential performance, that little sliver is what it is actually getting. What's going on here? You've probably guessed that the real problem is that these random writes induce a lot of seeks by the disk.. If we do kind of a back of the envelope calculation the average seek time on the disk is around 10 or 11 milliseconds. The OS is writing data at 4 kB granularity so there is about one seek for every 4 kB. If you work that out, that would imply about a half of a megabyte of bandwidth for random writes like this. We did a little bit better, but we weren't seeking over the entire disk. We were seeking over a 1 gigabyte file and presumably Linux did a little bit of bio scheduling to try to get things in some rough order. It's not surprising that it did a little bit better. And the disk might've done some more [indiscernible] >>: [indiscernible] >> Rob Johnson: Okay. The disk definitely did some… >>: You did [indiscernible] rotation time too, right? [indiscernible] >> Rob Johnson: Yeah. Still, this is a problem if you are doing random seeks for every I/O. There are other file systems that try to overcome this. The most well known class of file systems is logged structured file systems and what they do is whenever you write new data they just appended to a log. Appending to a log doesn't require any seeks and so you can write data really, really fast. The problem is that over time as you write to a file, in different locations of the file, the file gets kind of scattered over the disk and it's not stored in any meaningful order on disk. Then when you go to read back the file, then you have to do all of those seeks and they can be very slow. Log structured file systems have a different performance trade-off but they still represent a trade-off between these two types of operations, random writes and sequential reads. What BetrFS does is it uses a new class of data structure called write optimize indices and write optimize indices can take in data at a very high rate, so there are not really seek bound. They are more bandwidth bound, but they do maintain logical localities, so things that are logically consecutive are stored more or less, physically, consecutively on the disk. Then when you go to read them back you can read them back very fast. One of the main contributions of our BetrFS is a schema for mapping file system operations down to operations that the write optimized index can perform efficiently. We're trying to extract as much of the performance from the write optimized index as we can and carry it over to the file system. Another important thing we did was we implemented all of this inside the Linux kernel rather than using fuse or some other kind of external interface between the kernel and the file system. We found there were some opportunities for redesigning the interaction between the file system and the rest of the kernel to get some additional performance and I'll tell you a little bit about that. We're not the first to do a file system based on a write optimized index. There've been several others that have shown that write optimization can be used to speed up some operations in file systems. Our goal really is to take things as far as we can, to take this write optimization to its logical conclusion in file systems on to see how much we can get done. Also, this prior work was all in user space and as I mentioned we want to understand how write optimization affects the system design and the design of the higher-level file system. We wanted to do things in the kernel. That's the overview. I want to dive down now into what a write optimized data structure is and how you can build one and what it's performance characteristics are. For that I just want to introduce a simple performance model that we'll use to think about how well different data structures perform. This is called the disk access machine model. It's also sometimes called the external memory model, but it should hopefully look pretty intuitive to you. The computer has RAM of size m but that's not really going to come up in this talk. Then you have a disk of however big size you need. We don't care about the size of the disk. In order to operate on any data it has to be brought from the disk into RAM and data is moved in blocks of size B and then if RAM gets full you have to shuffle something back out to disk. The data size is going to be N and so when you see Ns later you'll know what they are referring to, the total number of data items. All we care about in this model is how many block transfers we perform during some operation, like a lookup or and inserting of a new item. We don't care about computation in memory at all. We're going to completely ignore that. We're going to just set that aside and maybe we'll come back and think about it later. >> Are all the blocks going to be the same size? >> Rob Johnson: Yes. This model simplifies lots of stuff. All the blocks of the same size. >> So metadata and data in particular, so you are just counting seeks, really? >> Rob Johnson: Yes. We are just counting seeks. We don't care about whether to blocks are adjacent are not. Every seek is a seek. We just count seeks. There is a lot of stuff that this model throws away. We're going to add back in bandwidth a little bit later in the talk. It's a very helpful model for understanding performance of data structures. As a warm-up, let's look at B trees. Who here knows B trees? Everyone knows B trees, okay. Let's do it just to make sure that we got the terminology on the page real quick. In the capital B tree you got nodes of size B so every time you access a block you are just going to read in the whole node and the entire space of the node is used to store pointers to the children and the pivot keys that tell you which child you need to go next. The fanout is B or roughly be. The height of the tree is going to be log based B of N and what that means is that if you want to do a query you have to do log based B of N block transfers to get to the leaf, and if you want to do an insert of a new item, basically, an insert is a query for an item to get the leaf that it would go into and then put it in the leaf and then go write that leaf back. So it's more or less just the same cost as a query. Maybe it's even a B plus tree. I don't know. We're not going to use them so I wasn't too careful about it. This data structure, by the way, you guys all know this so I am preaching to the choir maybe. This has been the data structure for databases for like 40 years. There are file system based on this like BtrFS and Linux, BtrFS. >> Like NTFS. >> Rob Johnson: NTFS is a B tree file system? >>: The directories are B trees. It's expense and B trees. That's been for at least 20 years. >> Rob Johnson: So it's a very venerable data structure, but it turns out it's not the best you can do. There is an optimal trade-off curve between the I/O cost of inserts and queries in an on disk data structure. This was proven back in 2003. The curve is parameterized by an epsilon and I'm not going to ask you to read or understand this. There will be a quiz on this part at the end of the talk, but it turns out that B trees are basically at one end of the curve. If you plug in epsilon equals one you get log based B of N for both operations. The point that I really care about is this one because if you plug in epsilon equals one half then you are still going to get log base B of N for queries. What you actually get is basically log based square root of B of N, but log base square root of B is just twice log base B. This is still border log based B. But the big doozy is that you get a squared of B in the denominator for an insert cost. And if you think about what a typical value for B might be it could be something like 1000. The square root of B is roughly like 30 and so what this means is that you can build a data structure that can ingest data in order of magnitude or two orders of magnitude faster than a B tree, but it's query performance is essentially the same. Yes? >>: What about the constant factor? >> Rob Johnson: There really aren't any hidden constant factors here except for maybe that factor or two because the tree will be twice as high. However, I wouldn't get too hung up on that. >>: [indiscernible] >> Rob Johnson: I think there might really be an extra factor of two that the other one is not paying. What my co-authors have told me that they seen in doing their startup is when they go to customer sites oftentimes it's not really a matter of trade-off between the queries and the inserts in maintaining an index. For many people's workloads the data comes in so fast that the only index that they can maintain is a timestamp index. When it comes time to query they have no index. This makes it possible to have an index at all. So it's not really a trade-off between this and this with a factor of two that's missing. It's this and well, I didn't have an index; I had to scan the data. Here's a cartoony version of this trade-off curve. If inserts being slow are here and fast over here and queries being slow versus fast and a B tree has fast queries but as far as this trade-off curve is concerned they are kind of slow inserts. Fortunately, the shape of the curve means there's a huge opportunity for us to slide our data structure along in this direction and get dramatic speedups and inserts with only a modest slowdown in queries. That's the target opportunity for optimized indexes. Questions? Everything so far so good? >>: On this slow and fast, slow is log slow on the one axis and linear slow on the other one, right? If you're just logging and you don't have any index then the queries are linear slow. >> Rob Johnson: Actually, I think logging, technically logging is not on the range of the curve that I showed you. Let me show you an example of a data structure that moves along this curve to the point that we were looking at. This data structure is called the B of the epsilon tree which is now you can guess how we got the name BetrFS. It's very similar to a B tree but we're going to use the space in our nodes of size B differently. We're only going to allocate a small amount of that space to pivots and pointers of children so the fanout of the tree is going to be square root of B instead of B. That means that the height of the tree is going to be log square root of B instead of log B. Than what we're going to do with the rest of this space in this notice we're going to use it as a buffer for newly inserted items. Whenever we insert an item into a tree, we simply stick it in the buffer of the root of that tree. La dee da, I've been inserting items. This is really fast. I don't have to do any log based B type work if I've got the root of the tree cached in memory this is basically no I/O all. I'm just adding things to the buffer. Once the buffer fills up I just flush all of the items down to the buffers of the children although it will leave just one little down. Let's think about the cost of different operations are going to be. Does everyone understand the data structure first? Items will always live on the path from the root to the belief that they belong in. We can find an item by simply doing a normal searching algorithm. The only extra cost is going to be that we have to look inside the buffer to see if maybe it hasn't gotten all the way to a leaf yet. We are ignoring computations so that's basically free. We don't count that. A query still basically takes log squared of B I/Os, just go a group to a leaf path traversal and the height is log squared of B. >>: [indiscernible] almost always hit one of the buffers. >> Rob Johnson: I'm not sure. It depends on the workload. >>: [indiscernible] you've got a lot of space in the stoppers. This data structure you are going to want to put most of the space in the buffers rather than in the leaves. [indiscernible] happen anyway. >> Rob Johnson: Let's suppose the square root of B is 30. Then I think you'll have this space divided by 30 at this level and then that divided by 30 again at the next level. I would estimate about 3percent of the space. >>: You would have the leaf would be size B. The row above it, so if the square root of B is 30 then B is 900, so that means you got 170 at each of the next leaf nodes and then 900 in the bottom, so about half of your space is one level up. >>: There's 30 times as many leaves as there are [indiscernible] >> Rob Johnson: So I think the total space of the top of the tree, square root of B would be about 3percent of the total. >> Right, so really doesn't matter. >> Rob Johnson: Most of your data is going to be in the leaves. Yet? >>: It seems like it's making access quick to the data near the root, but isn't that data in your RAM anyway and it doesn't need to be accessed quickly? I don't understand. >> Rob Johnson: We are avoiding the writes. Queries are the same. Your query still go all the way from root to leaf. In fact, as you're going to see we're going to end up going all the way from root to leaf anyway. >> I see. >>: So on the right, you don't have to write. >> Rob Johnson: I really appreciate the question. >>: I appreciate the answer. >> Rob Johnson: Let's analyze the insert cost. We're going to do and amortize analysis. The number of I/Os we need to perform a flush is we need to load square root of B things, that's square root of B I/Os. We can count this one square root of B +1, just call it square root of B. And then how many items do we get to move as part of this operation? We move B minus square root of B. Let's just call that B because the square root of B is nothing compared to B. The amortized I/Os required to move one item down one level is square root of B I/Os divided by B things that come down. So it's one over square root of B I/Os to move one thing down. And everything needs to move down that many levels. So the amortized I/O cost to get everything inserted into the database is this amortized I/O per element move down one step. This is the number of times an element needs to move down one step and so if you simplify that to be four you just get log based B of N divided by the square root of B. This data structure is superduper fast at handling inserts of new data. And lookups are basically the same as in a B tree. I just noticed a question online. This is probably a little bit of an old question. They asked can I precisely define sequential write and random write with an example or a scenario. Sorry, I didn't see the question sooner. A sequential write would be typically either something along the lines of you just call write with a 1 gigabyte buffer and say here write all this data to disk at the application level or you might do a bunch of small writes. Maybe you're writing 64 kB at a time but each 64 kB is after the last 64 kBs in the file. A random write you can think of it as when we run a random number generator that tells us the offset within the file that we want to write and then we write maybe a byte of data or eight bytes of data, some small amount of data to that offset in the file. Sequential writes occur all the time. You can think about any sort of multimedia application, streaming data from a video camera or something like that. Random writes occur often in databases where data is arriving and needs to be added into an index. >>: This works very well if the workload is uniformly distributed over the [indiscernible]. You push square root of B down when you need to flush. Do you need to consider uneven scenarios? >> Rob Johnson: Actually, you can get even better if you have bias. There are many slight tweaks on this data structure. I kind of gave you a pedagogically simple one, to where when this gets full you flush to all of the children. You could be greedy and say I'm going to figure out which of my children are going to get the most of the stuff and I'm only going to flush to that child. You still get the same amortized cost, but what that would mean is if someone was hammering on a particular region of the database just inserting a bunch of stuff to this leaf then this buffer would be completely full of things that were all going to the same child and so your amortized cost would be one over B instead of square root. >>: So what you're saying is that in your system you would flush in a bias way depending on the activity? >>: [indiscernible] >> Rob Johnson: Yes, you can do bias flushing. In fact, one thing we have been working on, I will get to your question in a second. As you're going to see sequential I/O performance is not as good as other file systems, but it is within a pretty decent constant factor. We are trying to speed that up. Sequential I/O you can think of as a very biased and sort of bunch of stuff is going to the same part of the tree. We care about that case. Your question was… >> Tree rebalancing. >> Rob Johnson: In terms of maintaining the balance of the tree, think of just as a capital Btree. We're going to do splits and joins of nodes and more or less the algorithm is exactly the same except you have to split the buffer when you split a node, but it's obvious how to do that and it turns out that just like in a capital B-tree, the actual I/O costs, even though splitting and merging is very complex in terms of the code, in terms of I/O it's small. It's really not that important. Did you have question? >>: I notice somehow we've got the square root of B. If you look at infinite trace each block is going to be written to the root and then it's going to be written, and eventually it will be written to the next level and actually all of the leaves the block will be written in order log over log square root of B and so why is the amortized cost not log square [indiscernible] >> Rob Johnson: Good. You're right. Every item does get written log square root of B N times, but the reason we get to amortize that is because when we move an item from one level down to the next we move a bunch of items. >> I see. >> Rob Johnson: Yeah. >>: So you skip this and end up with this? [indiscernible] >> Rob Johnson: Let's call it our file. [laughter]. One more operation I want to just keep in your head is range queries, the way you do a range query is pretty much the same as in a capital B-tree. You do a query for the start of the range and then you read the leaves. You do have to keep reading through the higher levels as well to check their buffers, but the total cost ends up being log based B of N. That's the I/Os to find the start of the range and then if your range query contains K items, K over B I/Os to collect all the items in your range. >>: It's just that sometimes it is and sometimes it isn't. >> Rob Johnson: We are going to fix the parameter of 1/2. I know I feel a little bit like I'm cheating. >> Honestly, it would probably be better to just skip the order of and carry the constants through all of this analysis. If this was a practice talk that is what I would tell you. [laughter]. >> Rob Johnson: That's not a bad idea. Okay. This is a B to the epsilon tree. I want to dive a little bit into range query performance and argue that we can do range queries at nearly disk bandwidth and the reason is that we can have very large values of B. What do I mean that we can have very large values of B? Doesn't the disk have a block size, just transferred data in blocks of 512 or 496 or whatever? But an application could choose to say I'm going to always issue reads into one megabyte or I am always going to issue reads into 64 kB. If you look at the asymptotic formulas I showed you a minute ago, it's quite obvious making B bigger reduces the number of I/Os I have to form and that is not a big surprise. I just put my whole database in one big block. But that is ignoring the other cost of doing a read which is there is a seek time and a set up time and then there is a transfer time. Obviously the transfer time is going to grow with this. So there is kind of a sweet spot that you have to choose in terms of how big it can be and it's a little bit complicated by the fact that if you work out the bandwidth costs for an insert it's squared of B times log base B, but the query it's B log base B and then for range queries it's good B log base B. But what we can do is we can tweak the data structure a little bit so that the bandwidth costs on all of these scale to the square root of B. Once we've done that, that means that the bandwidth costs grow very slowly with B and so we can make B really big. >>: What's really big to you? >> Rob Johnson: 4 megabytes, whereas the typical node size for a B-tree is like 4 kB. >>: SQL Server is eight and it's ridiculously small. >> Rob Johnson: Some of them use like 64 kB nodes, but they don't get much bigger than that. Most B-trees, that's called a big leaf B-tree. >>: Even at 4 megabytes [indiscernible] time seek. Ten, 15 milliseconds if you add them. >> Rob Johnson: No. I think if you're using 4 megabytes and your disk is 100 megabytes per second you're going to do about 25 seeks per second. 25 seeks is about, actually you're right, about a quarter of a second. >>: I've worked this out. >> Rob Johnson: You're good, you're good. So how do we get these bandwidth costs down? Essentially we are going to organize the internals of a node a little bit differently. We're just going to put all of the pivots up front. That's the pointers to the children and the keys to tell you which child it goes to. And then we'll just arrange the buffers to each child afterwards and each buffer will be sized square root of B. This isn't going to change the asymptotics of the insert costs at all. When an insert occurs it can do a single seek, read this whole thing, flush things down and then write it back to disk. But what it means is on a query all we need to do is read the pivots and read one buffer for the contents of the stuff that is going to that child. This is still a cost in terms of I/Os and now the bandwidth is reduced to square root of B to read that node rather than B, but it doesn't affect the insert costs at all. >>: Something's funny just to have that there. I didn't understand that but something very strange just happened. When you are you doing a point query I understand you read into this. That takes you to the I/O. How do you know which buffers to read and why is there only one of them? >> Rob Johnson: Good, okay. There are two ways you can approach this. One would be every buffer has a fixed size, see you just know. >>: Then you lose the unbalanced advantage to you just told us about. >> Rob Johnson: That's right. A baby is going out the window. I think we can get it all but I'm not going to describe the way that gets at all in this talk. What you do is you would read a pivot, so you read the collection of pivots and that would tell you which child your query is going to be routed to. It will also tell you which of these buffers you need to look in to see if the item you are looking for is actually in the buffer of this node. >>: It will do that, but the time to read the buffer at that point it is not small. >> Rob Johnson: That's right, so it's going to be two seeks instead of one. >>: Two seeks and I guess the things are big enough now that they cost like two seeks to read the whole thing. So you save not very much. It looks great because you're reading much less data, but you're actually saving not very much time. >> Rob Johnson: Right. I'm not going to go into the implementation that we use, but the implementation effectively does this trick at the leaves only but the internal modes it doesn't really bother. That's because most of the time the internal modes are going to be cached. You kind of imagine the top of the tree as cached anyway. >>: The top of the tree [indiscernible] trees are always cached. >> Rob Johnson: So it's not really worth it. >>: Next to bottom. >> Rob Johnson: I want to give you the intuition that in terms of the asymptotic that we can get things in the right shape that enable us to operate efficiently with very large node sizes. Now those Bs turn into square root of Bs and if you compare that to a B-tree, the B-tree the bandwidth costs are B. That's why B-trees versus B epsilon trees have this different sweet spot in terms of node size. So typical B-tree node sizes are 4 to 64 kilobytes. I can't say that there is a typical B epsilon tree node size because there's only one B epsilon tree implementation I know of in the world, but a good one is somewhere in the range of 2 to 4 megabytes. >>: It's funny because claiming that 64 kB is actually a good B-tree node size only happens when you count the wrong thing. What you actually count is time and the [indiscernible] bytes transferred is, no one is constrained by their I/O plus bandwidth. They are constrained by the [indiscernible] on the disk. 64 kilobytes is a nutty size for a B-tree. >> Rob Johnson: Yeah. I'm just making a statement about what it is. >>: I believe you, but there are two things to tease out of here. The B epsilon tree actually does better and also the things that they are competing with our tuned to computers of the 1970s. That's really true and so you get some credit for covering this and your competition loses some credit for lack of [indiscernible]. >> Rob Johnson: I'll take it. But what this means is that since the nodes are really big when we're doing a range query, essentially, we are reading a big swath of data and then we do a seek to read another 4 megabytes, so we read another 4 megabytes and we can do those range queries at disk bandwidth or nearly disk bandwidth. Also there's another question that popped up. Can you please briefly compare this tree to log structured merge trees? Who here knows log structured merge trees? Great. A log structured merge tree is a very popular write out to metadata structure that's used in Cassandra and HBase and a bunch other of these modern open-source databases. It is write optimized. There are some versions of a log structured merge tree that have the same asymptotics as a B to the epsilon tree. Most of them actually have worse asymptotics. The query complexity in a naïve LSM tree would be log squared, like log B of N log times log of N. You can improve that for point queries by doing a Bloom filtery thing that gets it down from point queries to log base B of N, but it doesn't help with range queries. You still have a log squared because you have to do a search within each of the levels of the LSM tree. I'm sorry if nobody else knows what LSM trees are and can follow that, but that's sort of the short version. There's another feature of a B to the epsilon tree and this also interacts poorly with LSM trees, but it works nicely in B to the epsilon trees, which is called the upsert. A lot of times in a database you got a record of the database. You want to update it and the normal way you would do that is read it, modify it, write it. If you think about it what we have just seen in a B epsilon tree is that the I/O complexity of a query is like log B, but I/O complexity of inserting is log B divided by the square root of B. So if every insert was tied to a query we would be basically running at the speed of the query and we wouldn't be getting these performance gains. We want to avoid doing queries whenever possible. And so and upsert enables us to transform a read, modify write operation into a blind just insert something into the tree. Yes, great. I like the look of that. I'm skeptical about this. Suppose I've got maybe a banking database and I got five dollars and it's keyed by my ID. I deposit $10, so an upsert is essentially a message that gets inserted into the tree just like any other item might be inserted into the tree, so it's just like any other insert algorithm or operation. But an upsert message has a destination key that this is going to apply to, and operation to be performed on that key and value once you have found it and then parameters to the operation. This just gets serialized and placed into the tree. You can think of it as like a continuation or a save function with some state. This message will get flushed down the tree over time just like any other piece of data that's been put into the tree until it finally reaches the leaf that holds the key that it's destined for. At that point the database system will apply this operation with this parameter to the old value, compute the new value and then replace the old value with the new value and then this message can be discarded. >>: And what is the order? >> Rob Johnson: Temporal ordering, so it will maintain temporal order within this buffer, but also there will be temporally ordered. The older upsert message will be filtered down the tree. Things never jump over each other in this flushing process. Here's one of the biggest steps where we ignore the cost of computation which is what if this upsert message is still sitting in the buffer and I do a query on my balance? What we do is we just on the fly apply the upsert message, so the query for my balance will descend the tree, get my old balance and then it will walk back up the tree looking for upsert messages that apply to this key and apply them and then return the current value. There's a question, once I've done this should I update the leaf or something like that, and there are actually some heuristics that you can use to decide is it worth the additional I/O of going ahead and flying this back to the disk. Or maybe it's cheaper to just throw it away and the tree remains unmodified. If you're querying that thing a lot maybe it's worth going ahead and updating the value. If the queries are rare it's not worth it. In this model we are just ignoring computation. We are going to ignore the cost of that. But this lets us do read, modify write type operations as fast as an insert. To summarize, here's the B to the epsilon performance. Inserts and upserts both run at log B over square root of B time which is superduper fast. Point queries are in log B time which is the same as a B-tree and range queries can run in time log B plus K over B but B is really big, so this is essentially a disk bandwidth. These are the asymptotics. What does it mean in practice? If you assume that the top of the tree is cached been most queries are about one or maybe two seeks and then if you are doing a range query you just read at disk bandwidth. That means if you're running on a spinning disk, you can do hundreds of random queries per second and then inserts and upserts you can do tens of thousands, sometimes if your computer is good enough, hundreds of thousands of these per second. >>: I just worked out the numbers. >> Rob Johnson: Okay. >>: And the numbers say one, if you have, so you form, if B is 4 megabytes and it takes you 16 bytes for a key in an offset, 8 by 2, actually you have eight bytes of file address name, or eight bytes of disk address name, that's it's logical file [indiscernible], that means that root B is 100, branching factors 125x4 megabyte [indiscernible], which means that a tree with just a root is a half a gigabyte. A tree with a root and one level under it is 62 gigabytes and a three-level tree is about 8 petabytes, which means that you rarely have a three-level tree. I was trying to understand what this is, but the real answer is these trees are one or two levels deep. There are asymptotics but the asymptotics go out, the final size grows so quickly in this that once you get to about a four or five level tree it's bigger than the total amount of disk space produced ever in the history of the world. Really, power of 125x4 megabytes comes quickly. >> Rob Johnson: It's actually about twice that. The implementation we use actually has a fan out average of about 10, not 125, so the tree is going to be about twice the size. >>: What are you doing with all of that? There is a lot of data in this. What did you put in there? >> Rob Johnson: This was engineered for variable size keys, so you don't really know how big the keys are. I had to go talk to the people who built it. >>: [indiscernible] >> Rob Johnson: Keys are actually, well, wait for it. Let me tell you about our file system. Keys are not just file boxes. >>: So you assume that what's synchronous feedback, query complete and you do a new query. And you change the computation model to queries and batches? [indiscernible] is synchronous [indiscernible] >> Rob Johnson: I haven't thought about that. I believe there are some lower bounds on batch query processing in external memory. And they are not very optimistic lower bounds. Once your data is large enough you can't do batch queries much faster than simply doing each of the queries one at a time. But I would have to, don't take that as gospel. Let me tell you about the file system. Hopefully, we'll have enough time to actually get to the file system here. The point here with these numbers is that if we want to get to the file system that's going to get this performance, we need to avoid queries and do blind inserts whenever possible. Here is our schema for implementing a file system. We maintain two B to the epsilon trees, one in which you call the metadata index and one in which is the data index and our keys in the metadata index are actually full paths. And the reason we use full paths is that means that files that are within the same directory will be logically adjacent to each other in the database and so if you want to do a recursive directory traversal that can be done at the speed of disk bandwidth. Full paths map to struct stat information, who owns it and how big it is. Data index maps full path and the block off set to the data. We use full paths here again. This file system does not have any notion of an inode and that's a radical departure from normal file systems. What we get for this is we get very fast directory scans and since these are sorted by the block number, data blocks will be laid out sequentially on disk, more or less. There is a node and every so often you have to jump to another node. Rename, we are working on. Rename is the downside of this. >>: That's always the problem. >> Rob Johnson: Here's a quick round up of how the operations get mapped from file system operations to B epsilon tree operations. A read is a range query. A write can become an upsert. Redir is a range query. We can do very fast, metadata can become an upsert. For example, in Linux there is this thing called A time on a file that they often turn off updating that. Basically, every time you update a file it has to update that thing on disk. Such an idiotic design. I love this rant that you can read on Wikipedia about the designer said let's turn every read into a write. But hey, now we can actually do it. We can do efficient directory scans. These operations only can rename, as you have already seen, don't map nicely to the operations that I told you about so far and so those are a problem right now. In the interest of time I'm going to skip the details on this. The high level point is that upserts enable you to write new data to disk very, very fast. Imagine an application that has a one byte write into a middle of a page that is cached and that page cache is clean. Normally, what the OS would do is it would write that byte to the in memory cache, mark it dirty and then later on write 4 kB of data back to the disk. What we can do is we just write the byte back to disk. We apply the byte to the in memory cache and we say it's still clean. That avoids write amplification where a small write gets amplified into a big write. If the page wasn't cached at all we can also avoid having to read the page in. This is one of the cool aspects of how write optimization can actually change what the right decision is in the design of your system, whether you should do write back or write to your caching. Here is the architecture of the system. We basically took B to the epsilon tree implementation and imported as a binary blob into the kernel. This is our VFS API that translates from file system operations to BetrFS operations. The disk code was user space code so it would expect the file systems, so we wrote a shim layer and then just used ext4 underneath so the ext4 is doing our block management on the disk for us. This is really kind of amazing. This is C++ code which in Linux is verboten and so we just compile it in UserLand and then just shove it in there and hope for the best and it actually works great. Let me tell you about performance results. Here's where the rubber meets the road. Do we actually get a speed up for random writes? That was one of the things that we started out with. How do we do on sequential I/O? We wanted to not sacrifice that. And do we actually see a speed up in any real world applications? We tested it on a computer. That computer had a disk. >>: [indiscernible] >> Rob Johnson: Yeah. We didn't do large experiments. >>: Obviously not. That's a really tiny disk. >> Rob Johnson: We're going to compare against several other common file systems. All of the tests start with a cold cache and all the tests end with sync operations to make sure that everything is actually on the disk. No cheating there. Here is the time it took us to do 1000 random four by writes into a 1 gigabyte file on the different file systems. This is BetrFS times the lower is better. Log scale. If you work it out and you want to know what the actual numbers are it's over 50 times faster. That's a huge improvement in random write performance by using write optimize data structure, which is what you would expect. That's their bread-and-butter. This is showing the benefit of being able to do these blind writes. Small file creation is also basically a small write operation. So creating a new file you get to update a little bit of metadata and then we write to a file. They're balanced in the directory so you don't hit any dumb file systems worst case where it's just using a link list for its directory data structure or something like that. What this graph is showing is the instantaneous creation rate for new files, again on a log scale. This is after it's created yay many files. You can see BetrFS is creating 10,000 files per second, where other file systems have drop down to 1000 or maybe even something in the range of 100 files per second. Again, this is the kind of breadand-butter application for write optimizing the structure. >>: Why doesn't zfs do it better? >> Rob Johnson: ZFS is weird. I don't know. I guess maybe it's warming up its cache. How are we doing on time? Again, this is a log scale. Sequential I/O, here we don't win. We are reading a gigabyte file about 40 kB at a time. Our read performance is within the ballpark of a normal file system and probably with some optimization we can close that gap. How are sequential write performance is actually a half or a third of most file systems performance. One reason could be that we'll writing all the data twice because our btrfs implementation does full data logging, whereas these other file systems are not doing full data logging except for ZFS which I understand, the students tell me it's doing logging in triplicate. It literally writes all the data three times. We're actually working on this. We have some tricks where we are going to preserve the semantics of full data journaling, but we are not going to have the cost. We're going to write the data one time, but that's the ongoing work. Here is our delete performance in the version of BetrFS that was described in the FAT paper. As the file gets bigger the time to delete gets bigger. >>: [indiscernible] >> Rob Johnson: So basically have a pointer, like a symbolic link? Degeneracies is when you have long chains to those links. If you rename it A to B and C to D and C to D you don't want to have those degeneracies. >>: [indiscernible] >> Rob Johnson: Yes. But then what happens if I rename a full directory and then it's named down inside that directory? I've got to think it through, but we have some other plans. We have fixed the delete performance by implementing potentially and upsert that applies to a range of messages, so it will delete all of these messages. Now this is our, I think we are actually faster now than most of the other file systems with deletes now. The old scaling was terrible; the new scaling is good. Does it actually benefit any real applications? Did we get the speed up that we wanted for recursive directory scans? Here we're doing a find which is only looking recursive traversal of metadata. Here we're doing a grep which is a recursive traversal of data. Here is time and here is BetrFS on the find and here's BetrFS on the grep. This is a pretty dramatic speed up for these kinds of benchmarks. I might say find and grep are artificial, but you might think about things like backup or a virus scan on a computer. These are real workloads that people care about. >>: This is a file system with small files? >> Rob Johnson: This was the Linux source code. >>: That's a small file, okay? >> Rob Johnson: Yes. This is because our sorting method, our way of organizing data puts at directories and their contents close together. Some real benchmarks, here is an IMAP server and it's doing a bunch of message reads and message markings. I don't actually know the benchmark in detail. Here is the time to run the benchmarks and as you can see we do quite a bit better than other file systems. But we are no longer in log scale territory. The only one that outperforms us is zfs which we have to look into how it does that. If you are seeing code between two directories on the same file system, so this is our syncing between BetrFS send the BetrFS directory. This is our syncing between btrfs and the btrfs directory, then the megabytes per second that we achieve is quite a bit higher than the others. That's presumably because we had to do our sync with the dash dash in place flag. Otherwise it actually makes a temporary file and renames it, which as you already pointed out we're not going to do well on. So we cheated a little bit. But if you do that you can basically issue blind writes to create the new files and blind writes to write the new data to the file, so it can run really fast. In summary, did we achieve our goals? Sequential I/O, I'm sorry, random I/O, I would say it's pretty much a slamdunk. I'm really proud of that performance. Sequential I/O, we've got work to do, but we have work in the pipeline. Yes, there are real world application benefits in this kind of performance. Wrapping up, the big picture message for our work is that we believe that it is possible to have your cake and eat it too. You can have a file system that supports good random I/O and good sequential reading of that data back. So you don't have to make this trade-off that you are sort of forced to with something like ext4 which is good sequential reads but not good on random I/O or something like a log structured file system which is good for random writes, but not so good as sequential reading. But write optimization so dramatically changes the performance of landscape of the underlying data structures that we need to revisit design decisions that we have made in the past. We abandoned inodes and organized the data quite differently. We use write through caching instead of write through caching because writing is so cheap now and we engineered the system to perform blind writes whenever possible. We think there's a lot of research opportunity here to figure out other ways that these new data structures can impact systems and the way they can be used to speed things up. And it's open source. All right. Thank you. [applause]. >>: It seems like a lot of benefits just come from the blind write optimization, which could be done without the write optimization. You could use blind writes, a block of blind writes on any file system. >>: And maybe that would be a good way to get your ideas into mainstream file systems one step at a time. >> Rob Johnson: That's actually an interesting point. Your idea would be maybe we would blindly writes stuff into a log or into a buffer in memory? >>: [indiscernible] keep in the buffer and a memory write. >> Rob Johnson: One thing I can point to is there is a database that does something like this, NODB, which is the database engine for MyS QL [phonetic], the default one. It has what's called an insert buffer and so as data is being inserted into an index it is being stored in memory and then when that buffer becomes full it tries to pick some part of the data that is all going to the same place on disk and just flushes it all the way through, so it just inserts it into a B-tree, standard B-tree on the data structure. And this gets a speed up. It's actually, NODBs are really good B-trees and it gets you maybe three or five times speed up, but it doesn't get you a 30 or 100x speed up in that setting. Maybe we could adapt it or do something a little bit different in file systems. Yeah. >>: I think disks are on the way out. [laughter]. There's this new thing called flash that's coming in and it has all of this change in the flash world. >> Rob Johnson: Some of my collaborators have benchmarked this on flash. Martin at Rutgers is doing that and there are two comments I would say about flash. One, I haven't seen the benchmark results. One is since this does writes in big chunks like 4 megabytes, it's actually much kinder to your flash in terms of erase cycles and having to and having to do garbage collection and all of that work that flash translation there has to do. The other thing I would say is although I don't know the file system benchmarks, I have seen benchmark results of a database built on the same [indiscernible] backend on flash and it dramatically outperforms a B-tree on the same flash disk. Even though flash doesn't have the big seek time, there is still set up costs. So you can get a win out of doing this. The next question usually asked is what about nonvolatile RAM? Aren't we all just going to have our file systems and everything in RAM? And there I don't have a good answer for you. That might be a disruptive technology for this. >>: Yeah, but we stop believing that. >> Rob Johnson: Go really? >>: Yeah, we have been waiting for base change memory now forever [indiscernible] base change memory +10 years. So there's the battery pack set of technologies and a lot of things that don't seem to work. >>: Any other questions? Thank Rob. [applause]

34577 >> Bryan Parno: Good morning everybody then thank you...

Related documents

Products

Support

34577 &gt;&gt; Bryan Parno: Good morning everybody then thank you...

Related documents

Add this document to collection(s)

Add this document to saved

Suggest us how to improve StudyLib

34577 >> Bryan Parno: Good morning everybody then thank you...