34577 >> Bryan Parno: Good morning everybody then thank you...

advertisement
34577
>> Bryan Parno: Good morning everybody then thank you for joining us. We're very fortunate
to have Professor Rob Johnson visiting us here. I know him from when he was a grad student at
Berkeley. The very first conference I went to I was trying to save money and he had a spare bed
in the room he was renting for the conference, so he looked out for me when I was a very
young grad student. It's great to have him here. I'm familiar with his work in the security world
where he has done some very cool stuff on language-based security looking for vulnerabilities
in the Linux kernel. But lately he's been looking at optimized data structures, typically, for highperformance systems and in this case file systems. Rob.
>> Rob Johnson: Thanks Bryan. Good morning and thanks for having me here. This is a talk
based on BetrFS which is a project has a lot of authors from a lot of different institutions. I'm
actually not going to name everybody but it's a collaboration between Stony Brook which is
where I'm from now. Tokutek Incorporated is a startup by some of my co-authors, Rutgers and
MIT. This is an expanded version of a talk from FAST, Linux FAST. My goal here is to
communicate information, so feel free to ask me questions and stop me at any time because I
really want to teach you about what optimized data structures are if you don't already know
them, and how they can benefit file system design, but also how they impact file system and
system design. They are kind of a radical departure from the data structure that we are used
to. Let me begin by explaining the motivation. If you take an off-the-shelf system, something
like ext 4 which is kind of the default file system in most Linux installations and you run a
benchmark where you just do a gigabyte of sequential writes and you run it on a spinning
magnetic disk and the desk is rated for 125 megabytes per second. You'll see that ext 4 gets a
throughput of about 104 megabytes per second. For a sequential write load ext 4 does great.
It's getting most of the disk bandwidth out of the disk. On the other hand, if you do a random
write workload, same file system, same disk but you're doing a small random write over a 1
gigabyte file, the throughput of the ext 4 is only about 1.5 megabytes per second. It's hardly
getting any of the performance out of the disk. It's really wasting a lot of potential
performance, that little sliver is what it is actually getting. What's going on here? You've
probably guessed that the real problem is that these random writes induce a lot of seeks by the
disk.. If we do kind of a back of the envelope calculation the average seek time on the disk is
around 10 or 11 milliseconds. The OS is writing data at 4 kB granularity so there is about one
seek for every 4 kB. If you work that out, that would imply about a half of a megabyte of
bandwidth for random writes like this. We did a little bit better, but we weren't seeking over
the entire disk. We were seeking over a 1 gigabyte file and presumably Linux did a little bit of
bio scheduling to try to get things in some rough order. It's not surprising that it did a little bit
better. And the disk might've done some more [indiscernible]
>>: [indiscernible]
>> Rob Johnson: Okay. The disk definitely did some…
>>: You did [indiscernible] rotation time too, right? [indiscernible]
>> Rob Johnson: Yeah. Still, this is a problem if you are doing random seeks for every I/O.
There are other file systems that try to overcome this. The most well known class of file
systems is logged structured file systems and what they do is whenever you write new data
they just appended to a log. Appending to a log doesn't require any seeks and so you can write
data really, really fast. The problem is that over time as you write to a file, in different locations
of the file, the file gets kind of scattered over the disk and it's not stored in any meaningful
order on disk. Then when you go to read back the file, then you have to do all of those seeks
and they can be very slow. Log structured file systems have a different performance trade-off
but they still represent a trade-off between these two types of operations, random writes and
sequential reads. What BetrFS does is it uses a new class of data structure called write optimize
indices and write optimize indices can take in data at a very high rate, so there are not really
seek bound. They are more bandwidth bound, but they do maintain logical localities, so things
that are logically consecutive are stored more or less, physically, consecutively on the disk.
Then when you go to read them back you can read them back very fast. One of the main
contributions of our BetrFS is a schema for mapping file system operations down to operations
that the write optimized index can perform efficiently. We're trying to extract as much of the
performance from the write optimized index as we can and carry it over to the file system.
Another important thing we did was we implemented all of this inside the Linux kernel rather
than using fuse or some other kind of external interface between the kernel and the file system.
We found there were some opportunities for redesigning the interaction between the file
system and the rest of the kernel to get some additional performance and I'll tell you a little bit
about that. We're not the first to do a file system based on a write optimized index. There've
been several others that have shown that write optimization can be used to speed up some
operations in file systems. Our goal really is to take things as far as we can, to take this write
optimization to its logical conclusion in file systems on to see how much we can get done. Also,
this prior work was all in user space and as I mentioned we want to understand how write
optimization affects the system design and the design of the higher-level file system. We
wanted to do things in the kernel. That's the overview. I want to dive down now into what a
write optimized data structure is and how you can build one and what it's performance
characteristics are. For that I just want to introduce a simple performance model that we'll use
to think about how well different data structures perform. This is called the disk access
machine model. It's also sometimes called the external memory model, but it should hopefully
look pretty intuitive to you. The computer has RAM of size m but that's not really going to
come up in this talk. Then you have a disk of however big size you need. We don't care about
the size of the disk. In order to operate on any data it has to be brought from the disk into RAM
and data is moved in blocks of size B and then if RAM gets full you have to shuffle something
back out to disk. The data size is going to be N and so when you see Ns later you'll know what
they are referring to, the total number of data items. All we care about in this model is how
many block transfers we perform during some operation, like a lookup or and inserting of a new
item. We don't care about computation in memory at all. We're going to completely ignore
that. We're going to just set that aside and maybe we'll come back and think about it later.
>> Are all the blocks going to be the same size?
>> Rob Johnson: Yes. This model simplifies lots of stuff. All the blocks of the same size.
>> So metadata and data in particular, so you are just counting seeks, really?
>> Rob Johnson: Yes. We are just counting seeks. We don't care about whether to blocks are
adjacent are not. Every seek is a seek. We just count seeks. There is a lot of stuff that this
model throws away. We're going to add back in bandwidth a little bit later in the talk. It's a
very helpful model for understanding performance of data structures. As a warm-up, let's look
at B trees. Who here knows B trees? Everyone knows B trees, okay. Let's do it just to make
sure that we got the terminology on the page real quick. In the capital B tree you got nodes of
size B so every time you access a block you are just going to read in the whole node and the
entire space of the node is used to store pointers to the children and the pivot keys that tell you
which child you need to go next. The fanout is B or roughly be. The height of the tree is going
to be log based B of N and what that means is that if you want to do a query you have to do log
based B of N block transfers to get to the leaf, and if you want to do an insert of a new item,
basically, an insert is a query for an item to get the leaf that it would go into and then put it in
the leaf and then go write that leaf back. So it's more or less just the same cost as a query.
Maybe it's even a B plus tree. I don't know. We're not going to use them so I wasn't too careful
about it. This data structure, by the way, you guys all know this so I am preaching to the choir
maybe. This has been the data structure for databases for like 40 years. There are file system
based on this like BtrFS and Linux, BtrFS.
>> Like NTFS.
>> Rob Johnson: NTFS is a B tree file system?
>>: The directories are B trees. It's expense and B trees. That's been for at least 20 years.
>> Rob Johnson: So it's a very venerable data structure, but it turns out it's not the best you
can do. There is an optimal trade-off curve between the I/O cost of inserts and queries in an on
disk data structure. This was proven back in 2003. The curve is parameterized by an epsilon
and I'm not going to ask you to read or understand this. There will be a quiz on this part at the
end of the talk, but it turns out that B trees are basically at one end of the curve. If you plug in
epsilon equals one you get log based B of N for both operations. The point that I really care
about is this one because if you plug in epsilon equals one half then you are still going to get log
base B of N for queries. What you actually get is basically log based square root of B of N, but
log base square root of B is just twice log base B. This is still border log based B. But the big
doozy is that you get a squared of B in the denominator for an insert cost. And if you think
about what a typical value for B might be it could be something like 1000. The square root of B
is roughly like 30 and so what this means is that you can build a data structure that can ingest
data in order of magnitude or two orders of magnitude faster than a B tree, but it's query
performance is essentially the same. Yes?
>>: What about the constant factor?
>> Rob Johnson: There really aren't any hidden constant factors here except for maybe that
factor or two because the tree will be twice as high. However, I wouldn't get too hung up on
that.
>>: [indiscernible]
>> Rob Johnson: I think there might really be an extra factor of two that the other one is not
paying. What my co-authors have told me that they seen in doing their startup is when they go
to customer sites oftentimes it's not really a matter of trade-off between the queries and the
inserts in maintaining an index. For many people's workloads the data comes in so fast that the
only index that they can maintain is a timestamp index. When it comes time to query they have
no index. This makes it possible to have an index at all. So it's not really a trade-off between
this and this with a factor of two that's missing. It's this and well, I didn't have an index; I had to
scan the data. Here's a cartoony version of this trade-off curve. If inserts being slow are here
and fast over here and queries being slow versus fast and a B tree has fast queries but as far as
this trade-off curve is concerned they are kind of slow inserts. Fortunately, the shape of the
curve means there's a huge opportunity for us to slide our data structure along in this direction
and get dramatic speedups and inserts with only a modest slowdown in queries. That's the
target opportunity for optimized indexes. Questions? Everything so far so good?
>>: On this slow and fast, slow is log slow on the one axis and linear slow on the other one,
right? If you're just logging and you don't have any index then the queries are linear slow.
>> Rob Johnson: Actually, I think logging, technically logging is not on the range of the curve
that I showed you. Let me show you an example of a data structure that moves along this curve
to the point that we were looking at. This data structure is called the B of the epsilon tree
which is now you can guess how we got the name BetrFS. It's very similar to a B tree but we're
going to use the space in our nodes of size B differently. We're only going to allocate a small
amount of that space to pivots and pointers of children so the fanout of the tree is going to be
square root of B instead of B. That means that the height of the tree is going to be log square
root of B instead of log B. Than what we're going to do with the rest of this space in this notice
we're going to use it as a buffer for newly inserted items. Whenever we insert an item into a
tree, we simply stick it in the buffer of the root of that tree. La dee da, I've been inserting
items. This is really fast. I don't have to do any log based B type work if I've got the root of the
tree cached in memory this is basically no I/O all. I'm just adding things to the buffer. Once the
buffer fills up I just flush all of the items down to the buffers of the children although it will
leave just one little down. Let's think about the cost of different operations are going to be.
Does everyone understand the data structure first? Items will always live on the path from the
root to the belief that they belong in. We can find an item by simply doing a normal searching
algorithm. The only extra cost is going to be that we have to look inside the buffer to see if
maybe it hasn't gotten all the way to a leaf yet. We are ignoring computations so that's
basically free. We don't count that. A query still basically takes log squared of B I/Os, just go a
group to a leaf path traversal and the height is log squared of B.
>>: [indiscernible] almost always hit one of the buffers.
>> Rob Johnson: I'm not sure. It depends on the workload.
>>: [indiscernible] you've got a lot of space in the stoppers. This data structure you are going
to want to put most of the space in the buffers rather than in the leaves. [indiscernible] happen
anyway.
>> Rob Johnson: Let's suppose the square root of B is 30. Then I think you'll have this space
divided by 30 at this level and then that divided by 30 again at the next level. I would estimate
about 3percent of the space.
>>: You would have the leaf would be size B. The row above it, so if the square root of B is 30
then B is 900, so that means you got 170 at each of the next leaf nodes and then 900 in the
bottom, so about half of your space is one level up.
>>: There's 30 times as many leaves as there are [indiscernible]
>> Rob Johnson: So I think the total space of the top of the tree, square root of B would be
about 3percent of the total.
>> Right, so really doesn't matter.
>> Rob Johnson: Most of your data is going to be in the leaves. Yet?
>>: It seems like it's making access quick to the data near the root, but isn't that data in your
RAM anyway and it doesn't need to be accessed quickly? I don't understand.
>> Rob Johnson: We are avoiding the writes. Queries are the same. Your query still go all the
way from root to leaf. In fact, as you're going to see we're going to end up going all the way
from root to leaf anyway.
>> I see.
>>: So on the right, you don't have to write.
>> Rob Johnson: I really appreciate the question.
>>: I appreciate the answer.
>> Rob Johnson: Let's analyze the insert cost. We're going to do and amortize analysis. The
number of I/Os we need to perform a flush is we need to load square root of B things, that's
square root of B I/Os. We can count this one square root of B +1, just call it square root of B.
And then how many items do we get to move as part of this operation? We move B minus
square root of B. Let's just call that B because the square root of B is nothing compared to B.
The amortized I/Os required to move one item down one level is square root of B I/Os divided
by B things that come down. So it's one over square root of B I/Os to move one thing down.
And everything needs to move down that many levels. So the amortized I/O cost to get
everything inserted into the database is this amortized I/O per element move down one step.
This is the number of times an element needs to move down one step and so if you simplify
that to be four you just get log based B of N divided by the square root of B. This data structure
is superduper fast at handling inserts of new data. And lookups are basically the same as in a B
tree. I just noticed a question online. This is probably a little bit of an old question. They asked
can I precisely define sequential write and random write with an example or a scenario. Sorry, I
didn't see the question sooner. A sequential write would be typically either something along
the lines of you just call write with a 1 gigabyte buffer and say here write all this data to disk at
the application level or you might do a bunch of small writes. Maybe you're writing 64 kB at a
time but each 64 kB is after the last 64 kBs in the file. A random write you can think of it as
when we run a random number generator that tells us the offset within the file that we want to
write and then we write maybe a byte of data or eight bytes of data, some small amount of
data to that offset in the file. Sequential writes occur all the time. You can think about any sort
of multimedia application, streaming data from a video camera or something like that. Random
writes occur often in databases where data is arriving and needs to be added into an index.
>>: This works very well if the workload is uniformly distributed over the [indiscernible]. You
push square root of B down when you need to flush. Do you need to consider uneven
scenarios?
>> Rob Johnson: Actually, you can get even better if you have bias. There are many slight
tweaks on this data structure. I kind of gave you a pedagogically simple one, to where when
this gets full you flush to all of the children. You could be greedy and say I'm going to figure out
which of my children are going to get the most of the stuff and I'm only going to flush to that
child. You still get the same amortized cost, but what that would mean is if someone was
hammering on a particular region of the database just inserting a bunch of stuff to this leaf then
this buffer would be completely full of things that were all going to the same child and so your
amortized cost would be one over B instead of square root.
>>: So what you're saying is that in your system you would flush in a bias way depending on the
activity?
>>: [indiscernible]
>> Rob Johnson: Yes, you can do bias flushing. In fact, one thing we have been working on, I
will get to your question in a second. As you're going to see sequential I/O performance is not
as good as other file systems, but it is within a pretty decent constant factor. We are trying to
speed that up. Sequential I/O you can think of as a very biased and sort of bunch of stuff is
going to the same part of the tree. We care about that case. Your question was…
>> Tree rebalancing.
>> Rob Johnson: In terms of maintaining the balance of the tree, think of just as a capital Btree. We're going to do splits and joins of nodes and more or less the algorithm is exactly the
same except you have to split the buffer when you split a node, but it's obvious how to do that
and it turns out that just like in a capital B-tree, the actual I/O costs, even though splitting and
merging is very complex in terms of the code, in terms of I/O it's small. It's really not that
important. Did you have question?
>>: I notice somehow we've got the square root of B. If you look at infinite trace each block is
going to be written to the root and then it's going to be written, and eventually it will be
written to the next level and actually all of the leaves the block will be written in order log over
log square root of B and so why is the amortized cost not log square [indiscernible]
>> Rob Johnson: Good. You're right. Every item does get written log square root of B N times,
but the reason we get to amortize that is because when we move an item from one level down
to the next we move a bunch of items.
>> I see.
>> Rob Johnson: Yeah.
>>: So you skip this and end up with this? [indiscernible]
>> Rob Johnson: Let's call it our file. [laughter]. One more operation I want to just keep in
your head is range queries, the way you do a range query is pretty much the same as in a
capital B-tree. You do a query for the start of the range and then you read the leaves. You do
have to keep reading through the higher levels as well to check their buffers, but the total cost
ends up being log based B of N. That's the I/Os to find the start of the range and then if your
range query contains K items, K over B I/Os to collect all the items in your range.
>>: It's just that sometimes it is and sometimes it isn't.
>> Rob Johnson: We are going to fix the parameter of 1/2. I know I feel a little bit like I'm
cheating.
>> Honestly, it would probably be better to just skip the order of and carry the constants
through all of this analysis. If this was a practice talk that is what I would tell you. [laughter].
>> Rob Johnson: That's not a bad idea. Okay. This is a B to the epsilon tree. I want to dive a
little bit into range query performance and argue that we can do range queries at nearly disk
bandwidth and the reason is that we can have very large values of B. What do I mean that we
can have very large values of B? Doesn't the disk have a block size, just transferred data in
blocks of 512 or 496 or whatever? But an application could choose to say I'm going to always
issue reads into one megabyte or I am always going to issue reads into 64 kB. If you look at the
asymptotic formulas I showed you a minute ago, it's quite obvious making B bigger reduces the
number of I/Os I have to form and that is not a big surprise. I just put my whole database in
one big block. But that is ignoring the other cost of doing a read which is there is a seek time
and a set up time and then there is a transfer time. Obviously the transfer time is going to grow
with this. So there is kind of a sweet spot that you have to choose in terms of how big it can be
and it's a little bit complicated by the fact that if you work out the bandwidth costs for an insert
it's squared of B times log base B, but the query it's B log base B and then for range queries it's
good B log base B. But what we can do is we can tweak the data structure a little bit so that the
bandwidth costs on all of these scale to the square root of B. Once we've done that, that
means that the bandwidth costs grow very slowly with B and so we can make B really big.
>>: What's really big to you?
>> Rob Johnson: 4 megabytes, whereas the typical node size for a B-tree is like 4 kB.
>>: SQL Server is eight and it's ridiculously small.
>> Rob Johnson: Some of them use like 64 kB nodes, but they don't get much bigger than that.
Most B-trees, that's called a big leaf B-tree.
>>: Even at 4 megabytes [indiscernible] time seek. Ten, 15 milliseconds if you add them.
>> Rob Johnson: No. I think if you're using 4 megabytes and your disk is 100 megabytes per
second you're going to do about 25 seeks per second. 25 seeks is about, actually you're right,
about a quarter of a second.
>>: I've worked this out.
>> Rob Johnson: You're good, you're good. So how do we get these bandwidth costs down?
Essentially we are going to organize the internals of a node a little bit differently. We're just
going to put all of the pivots up front. That's the pointers to the children and the keys to tell
you which child it goes to. And then we'll just arrange the buffers to each child afterwards and
each buffer will be sized square root of B. This isn't going to change the asymptotics of the
insert costs at all. When an insert occurs it can do a single seek, read this whole thing, flush
things down and then write it back to disk. But what it means is on a query all we need to do is
read the pivots and read one buffer for the contents of the stuff that is going to that child. This
is still a cost in terms of I/Os and now the bandwidth is reduced to square root of B to read that
node rather than B, but it doesn't affect the insert costs at all.
>>: Something's funny just to have that there. I didn't understand that but something very
strange just happened. When you are you doing a point query I understand you read into this.
That takes you to the I/O. How do you know which buffers to read and why is there only one of
them?
>> Rob Johnson: Good, okay. There are two ways you can approach this. One would be every
buffer has a fixed size, see you just know.
>>: Then you lose the unbalanced advantage to you just told us about.
>> Rob Johnson: That's right. A baby is going out the window. I think we can get it all but I'm
not going to describe the way that gets at all in this talk. What you do is you would read a
pivot, so you read the collection of pivots and that would tell you which child your query is
going to be routed to. It will also tell you which of these buffers you need to look in to see if
the item you are looking for is actually in the buffer of this node.
>>: It will do that, but the time to read the buffer at that point it is not small.
>> Rob Johnson: That's right, so it's going to be two seeks instead of one.
>>: Two seeks and I guess the things are big enough now that they cost like two seeks to read
the whole thing. So you save not very much. It looks great because you're reading much less
data, but you're actually saving not very much time.
>> Rob Johnson: Right. I'm not going to go into the implementation that we use, but the
implementation effectively does this trick at the leaves only but the internal modes it doesn't
really bother. That's because most of the time the internal modes are going to be cached. You
kind of imagine the top of the tree as cached anyway.
>>: The top of the tree [indiscernible] trees are always cached.
>> Rob Johnson: So it's not really worth it.
>>: Next to bottom.
>> Rob Johnson: I want to give you the intuition that in terms of the asymptotic that we can get
things in the right shape that enable us to operate efficiently with very large node sizes. Now
those Bs turn into square root of Bs and if you compare that to a B-tree, the B-tree the
bandwidth costs are B. That's why B-trees versus B epsilon trees have this different sweet spot
in terms of node size. So typical B-tree node sizes are 4 to 64 kilobytes. I can't say that there is
a typical B epsilon tree node size because there's only one B epsilon tree implementation I
know of in the world, but a good one is somewhere in the range of 2 to 4 megabytes.
>>: It's funny because claiming that 64 kB is actually a good B-tree node size only happens
when you count the wrong thing. What you actually count is time and the [indiscernible] bytes
transferred is, no one is constrained by their I/O plus bandwidth. They are constrained by the
[indiscernible] on the disk. 64 kilobytes is a nutty size for a B-tree.
>> Rob Johnson: Yeah. I'm just making a statement about what it is.
>>: I believe you, but there are two things to tease out of here. The B epsilon tree actually
does better and also the things that they are competing with our tuned to computers of the
1970s. That's really true and so you get some credit for covering this and your competition
loses some credit for lack of [indiscernible].
>> Rob Johnson: I'll take it. But what this means is that since the nodes are really big when
we're doing a range query, essentially, we are reading a big swath of data and then we do a
seek to read another 4 megabytes, so we read another 4 megabytes and we can do those range
queries at disk bandwidth or nearly disk bandwidth. Also there's another question that popped
up. Can you please briefly compare this tree to log structured merge trees? Who here knows
log structured merge trees? Great. A log structured merge tree is a very popular write out to
metadata structure that's used in Cassandra and HBase and a bunch other of these modern
open-source databases. It is write optimized. There are some versions of a log structured
merge tree that have the same asymptotics as a B to the epsilon tree. Most of them actually
have worse asymptotics. The query complexity in a naïve LSM tree would be log squared, like
log B of N log times log of N. You can improve that for point queries by doing a Bloom filtery
thing that gets it down from point queries to log base B of N, but it doesn't help with range
queries. You still have a log squared because you have to do a search within each of the levels
of the LSM tree. I'm sorry if nobody else knows what LSM trees are and can follow that, but
that's sort of the short version. There's another feature of a B to the epsilon tree and this also
interacts poorly with LSM trees, but it works nicely in B to the epsilon trees, which is called the
upsert. A lot of times in a database you got a record of the database. You want to update it
and the normal way you would do that is read it, modify it, write it. If you think about it what
we have just seen in a B epsilon tree is that the I/O complexity of a query is like log B, but I/O
complexity of inserting is log B divided by the square root of B. So if every insert was tied to a
query we would be basically running at the speed of the query and we wouldn't be getting
these performance gains. We want to avoid doing queries whenever possible. And so and
upsert enables us to transform a read, modify write operation into a blind just insert something
into the tree. Yes, great. I like the look of that. I'm skeptical about this. Suppose I've got
maybe a banking database and I got five dollars and it's keyed by my ID. I deposit $10, so an
upsert is essentially a message that gets inserted into the tree just like any other item might be
inserted into the tree, so it's just like any other insert algorithm or operation. But an upsert
message has a destination key that this is going to apply to, and operation to be performed on
that key and value once you have found it and then parameters to the operation. This just gets
serialized and placed into the tree. You can think of it as like a continuation or a save function
with some state. This message will get flushed down the tree over time just like any other
piece of data that's been put into the tree until it finally reaches the leaf that holds the key that
it's destined for. At that point the database system will apply this operation with this
parameter to the old value, compute the new value and then replace the old value with the
new value and then this message can be discarded.
>>: And what is the order?
>> Rob Johnson: Temporal ordering, so it will maintain temporal order within this buffer, but
also there will be temporally ordered. The older upsert message will be filtered down the tree.
Things never jump over each other in this flushing process. Here's one of the biggest steps
where we ignore the cost of computation which is what if this upsert message is still sitting in
the buffer and I do a query on my balance? What we do is we just on the fly apply the upsert
message, so the query for my balance will descend the tree, get my old balance and then it will
walk back up the tree looking for upsert messages that apply to this key and apply them and
then return the current value. There's a question, once I've done this should I update the leaf
or something like that, and there are actually some heuristics that you can use to decide is it
worth the additional I/O of going ahead and flying this back to the disk. Or maybe it's cheaper
to just throw it away and the tree remains unmodified. If you're querying that thing a lot
maybe it's worth going ahead and updating the value. If the queries are rare it's not worth it.
In this model we are just ignoring computation. We are going to ignore the cost of that. But
this lets us do read, modify write type operations as fast as an insert. To summarize, here's the
B to the epsilon performance. Inserts and upserts both run at log B over square root of B time
which is superduper fast. Point queries are in log B time which is the same as a B-tree and
range queries can run in time log B plus K over B but B is really big, so this is essentially a disk
bandwidth. These are the asymptotics. What does it mean in practice? If you assume that the
top of the tree is cached been most queries are about one or maybe two seeks and then if you
are doing a range query you just read at disk bandwidth. That means if you're running on a
spinning disk, you can do hundreds of random queries per second and then inserts and upserts
you can do tens of thousands, sometimes if your computer is good enough, hundreds of
thousands of these per second.
>>: I just worked out the numbers.
>> Rob Johnson: Okay.
>>: And the numbers say one, if you have, so you form, if B is 4 megabytes and it takes you 16
bytes for a key in an offset, 8 by 2, actually you have eight bytes of file address name, or eight
bytes of disk address name, that's it's logical file [indiscernible], that means that root B is 100,
branching factors 125x4 megabyte [indiscernible], which means that a tree with just a root is a
half a gigabyte. A tree with a root and one level under it is 62 gigabytes and a three-level tree is
about 8 petabytes, which means that you rarely have a three-level tree. I was trying to
understand what this is, but the real answer is these trees are one or two levels deep. There
are asymptotics but the asymptotics go out, the final size grows so quickly in this that once you
get to about a four or five level tree it's bigger than the total amount of disk space produced
ever in the history of the world. Really, power of 125x4 megabytes comes quickly.
>> Rob Johnson: It's actually about twice that. The implementation we use actually has a fan
out average of about 10, not 125, so the tree is going to be about twice the size.
>>: What are you doing with all of that? There is a lot of data in this. What did you put in
there?
>> Rob Johnson: This was engineered for variable size keys, so you don't really know how big
the keys are. I had to go talk to the people who built it.
>>: [indiscernible]
>> Rob Johnson: Keys are actually, well, wait for it. Let me tell you about our file system. Keys
are not just file boxes.
>>: So you assume that what's synchronous feedback, query complete and you do a new
query. And you change the computation model to queries and batches? [indiscernible] is
synchronous [indiscernible]
>> Rob Johnson: I haven't thought about that. I believe there are some lower bounds on batch
query processing in external memory. And they are not very optimistic lower bounds. Once
your data is large enough you can't do batch queries much faster than simply doing each of the
queries one at a time. But I would have to, don't take that as gospel. Let me tell you about the
file system. Hopefully, we'll have enough time to actually get to the file system here. The point
here with these numbers is that if we want to get to the file system that's going to get this
performance, we need to avoid queries and do blind inserts whenever possible. Here is our
schema for implementing a file system. We maintain two B to the epsilon trees, one in which
you call the metadata index and one in which is the data index and our keys in the metadata
index are actually full paths. And the reason we use full paths is that means that files that are
within the same directory will be logically adjacent to each other in the database and so if you
want to do a recursive directory traversal that can be done at the speed of disk bandwidth. Full
paths map to struct stat information, who owns it and how big it is. Data index maps full path
and the block off set to the data. We use full paths here again. This file system does not have
any notion of an inode and that's a radical departure from normal file systems. What we get
for this is we get very fast directory scans and since these are sorted by the block number, data
blocks will be laid out sequentially on disk, more or less. There is a node and every so often you
have to jump to another node. Rename, we are working on. Rename is the downside of this.
>>: That's always the problem.
>> Rob Johnson: Here's a quick round up of how the operations get mapped from file system
operations to B epsilon tree operations. A read is a range query. A write can become an
upsert. Redir is a range query. We can do very fast, metadata can become an upsert. For
example, in Linux there is this thing called A time on a file that they often turn off updating that.
Basically, every time you update a file it has to update that thing on disk. Such an idiotic design.
I love this rant that you can read on Wikipedia about the designer said let's turn every read into
a write. But hey, now we can actually do it. We can do efficient directory scans. These
operations only can rename, as you have already seen, don't map nicely to the operations that I
told you about so far and so those are a problem right now. In the interest of time I'm going to
skip the details on this. The high level point is that upserts enable you to write new data to disk
very, very fast. Imagine an application that has a one byte write into a middle of a page that is
cached and that page cache is clean. Normally, what the OS would do is it would write that
byte to the in memory cache, mark it dirty and then later on write 4 kB of data back to the disk.
What we can do is we just write the byte back to disk. We apply the byte to the in memory
cache and we say it's still clean. That avoids write amplification where a small write gets
amplified into a big write. If the page wasn't cached at all we can also avoid having to read the
page in. This is one of the cool aspects of how write optimization can actually change what the
right decision is in the design of your system, whether you should do write back or write to your
caching. Here is the architecture of the system. We basically took B to the epsilon tree
implementation and imported as a binary blob into the kernel. This is our VFS API that
translates from file system operations to BetrFS operations. The disk code was user space code
so it would expect the file systems, so we wrote a shim layer and then just used ext4
underneath so the ext4 is doing our block management on the disk for us. This is really kind of
amazing. This is C++ code which in Linux is verboten and so we just compile it in UserLand and
then just shove it in there and hope for the best and it actually works great. Let me tell you
about performance results. Here's where the rubber meets the road. Do we actually get a
speed up for random writes? That was one of the things that we started out with. How do we
do on sequential I/O? We wanted to not sacrifice that. And do we actually see a speed up in
any real world applications? We tested it on a computer. That computer had a disk.
>>: [indiscernible]
>> Rob Johnson: Yeah. We didn't do large experiments.
>>: Obviously not. That's a really tiny disk.
>> Rob Johnson: We're going to compare against several other common file systems. All of the
tests start with a cold cache and all the tests end with sync operations to make sure that
everything is actually on the disk. No cheating there. Here is the time it took us to do 1000
random four by writes into a 1 gigabyte file on the different file systems. This is BetrFS times
the lower is better. Log scale. If you work it out and you want to know what the actual
numbers are it's over 50 times faster. That's a huge improvement in random write
performance by using write optimize data structure, which is what you would expect. That's
their bread-and-butter. This is showing the benefit of being able to do these blind writes.
Small file creation is also basically a small write operation. So creating a new file you get to
update a little bit of metadata and then we write to a file. They're balanced in the directory so
you don't hit any dumb file systems worst case where it's just using a link list for its directory
data structure or something like that. What this graph is showing is the instantaneous creation
rate for new files, again on a log scale. This is after it's created yay many files. You can see
BetrFS is creating 10,000 files per second, where other file systems have drop down to 1000 or
maybe even something in the range of 100 files per second. Again, this is the kind of breadand-butter application for write optimizing the structure.
>>: Why doesn't zfs do it better?
>> Rob Johnson: ZFS is weird. I don't know. I guess maybe it's warming up its cache. How are
we doing on time? Again, this is a log scale. Sequential I/O, here we don't win. We are reading
a gigabyte file about 40 kB at a time. Our read performance is within the ballpark of a normal
file system and probably with some optimization we can close that gap. How are sequential
write performance is actually a half or a third of most file systems performance. One reason
could be that we'll writing all the data twice because our btrfs implementation does full data
logging, whereas these other file systems are not doing full data logging except for ZFS which I
understand, the students tell me it's doing logging in triplicate. It literally writes all the data
three times. We're actually working on this. We have some tricks where we are going to
preserve the semantics of full data journaling, but we are not going to have the cost. We're
going to write the data one time, but that's the ongoing work. Here is our delete performance
in the version of BetrFS that was described in the FAT paper. As the file gets bigger the time to
delete gets bigger.
>>: [indiscernible]
>> Rob Johnson: So basically have a pointer, like a symbolic link? Degeneracies is when you
have long chains to those links. If you rename it A to B and C to D and C to D you don't want to
have those degeneracies.
>>: [indiscernible]
>> Rob Johnson: Yes. But then what happens if I rename a full directory and then it's named
down inside that directory? I've got to think it through, but we have some other plans. We
have fixed the delete performance by implementing potentially and upsert that applies to a
range of messages, so it will delete all of these messages. Now this is our, I think we are
actually faster now than most of the other file systems with deletes now. The old scaling was
terrible; the new scaling is good. Does it actually benefit any real applications? Did we get the
speed up that we wanted for recursive directory scans? Here we're doing a find which is only
looking recursive traversal of metadata. Here we're doing a grep which is a recursive traversal
of data. Here is time and here is BetrFS on the find and here's BetrFS on the grep. This is a
pretty dramatic speed up for these kinds of benchmarks. I might say find and grep are artificial,
but you might think about things like backup or a virus scan on a computer. These are real
workloads that people care about.
>>: This is a file system with small files?
>> Rob Johnson: This was the Linux source code.
>>: That's a small file, okay?
>> Rob Johnson: Yes. This is because our sorting method, our way of organizing data puts at
directories and their contents close together. Some real benchmarks, here is an IMAP server
and it's doing a bunch of message reads and message markings. I don't actually know the
benchmark in detail. Here is the time to run the benchmarks and as you can see we do quite a
bit better than other file systems. But we are no longer in log scale territory. The only one that
outperforms us is zfs which we have to look into how it does that. If you are seeing code
between two directories on the same file system, so this is our syncing between BetrFS send
the BetrFS directory. This is our syncing between btrfs and the btrfs directory, then the
megabytes per second that we achieve is quite a bit higher than the others. That's presumably
because we had to do our sync with the dash dash in place flag. Otherwise it actually makes a
temporary file and renames it, which as you already pointed out we're not going to do well on.
So we cheated a little bit. But if you do that you can basically issue blind writes to create the
new files and blind writes to write the new data to the file, so it can run really fast. In summary,
did we achieve our goals? Sequential I/O, I'm sorry, random I/O, I would say it's pretty much a
slamdunk. I'm really proud of that performance. Sequential I/O, we've got work to do, but we
have work in the pipeline. Yes, there are real world application benefits in this kind of
performance. Wrapping up, the big picture message for our work is that we believe that it is
possible to have your cake and eat it too. You can have a file system that supports good
random I/O and good sequential reading of that data back. So you don't have to make this
trade-off that you are sort of forced to with something like ext4 which is good sequential reads
but not good on random I/O or something like a log structured file system which is good for
random writes, but not so good as sequential reading. But write optimization so dramatically
changes the performance of landscape of the underlying data structures that we need to revisit
design decisions that we have made in the past. We abandoned inodes and organized the data
quite differently. We use write through caching instead of write through caching because
writing is so cheap now and we engineered the system to perform blind writes whenever
possible. We think there's a lot of research opportunity here to figure out other ways that
these new data structures can impact systems and the way they can be used to speed things
up. And it's open source. All right. Thank you. [applause].
>>: It seems like a lot of benefits just come from the blind write optimization, which could be
done without the write optimization. You could use blind writes, a block of blind writes on any
file system.
>>: And maybe that would be a good way to get your ideas into mainstream file systems one
step at a time.
>> Rob Johnson: That's actually an interesting point. Your idea would be maybe we would
blindly writes stuff into a log or into a buffer in memory?
>>: [indiscernible] keep in the buffer and a memory write.
>> Rob Johnson: One thing I can point to is there is a database that does something like this,
NODB, which is the database engine for MyS QL [phonetic], the default one. It has what's called
an insert buffer and so as data is being inserted into an index it is being stored in memory and
then when that buffer becomes full it tries to pick some part of the data that is all going to the
same place on disk and just flushes it all the way through, so it just inserts it into a B-tree,
standard B-tree on the data structure. And this gets a speed up. It's actually, NODBs are really
good B-trees and it gets you maybe three or five times speed up, but it doesn't get you a 30 or
100x speed up in that setting. Maybe we could adapt it or do something a little bit different in
file systems. Yeah.
>>: I think disks are on the way out. [laughter]. There's this new thing called flash that's
coming in and it has all of this change in the flash world.
>> Rob Johnson: Some of my collaborators have benchmarked this on flash. Martin at Rutgers
is doing that and there are two comments I would say about flash. One, I haven't seen the
benchmark results. One is since this does writes in big chunks like 4 megabytes, it's actually
much kinder to your flash in terms of erase cycles and having to and having to do garbage
collection and all of that work that flash translation there has to do. The other thing I would say
is although I don't know the file system benchmarks, I have seen benchmark results of a
database built on the same [indiscernible] backend on flash and it dramatically outperforms a
B-tree on the same flash disk. Even though flash doesn't have the big seek time, there is still set
up costs. So you can get a win out of doing this. The next question usually asked is what about
nonvolatile RAM? Aren't we all just going to have our file systems and everything in RAM? And
there I don't have a good answer for you. That might be a disruptive technology for this.
>>: Yeah, but we stop believing that.
>> Rob Johnson: Go really?
>>: Yeah, we have been waiting for base change memory now forever [indiscernible] base
change memory +10 years. So there's the battery pack set of technologies and a lot of things
that don't seem to work.
>>: Any other questions? Thank Rob. [applause]
Download