>> Jin Li: It's my great pleasure to introduce... is an undergrad at Harvard University, and he is currently...

advertisement
>> Jin Li: It's my great pleasure to introduce William Josephson. William
is an undergrad at Harvard University, and he is currently a graduate
student at Princeton University working with Professor Kai Li on how to
use flash drives to do storage systems. Before joining Princeton, William
has been working with data domain, a leader in video application systems.
He has also interned in a number of labs such as Sun and Bell Labs. He
also working with institute of defense analysis on a number of locations.
Today he will give us a talk on how to build a flash system -- a file system
for next generation of flash drives. Without further delay, let's hear what
William has to say about.
>> William Josephson: Thank you. I want to preface this talk by saying it
really is a work in progress. Some of the results are -- is this on? I think
so. Some of this is hot off the press as in last week I was still fiddling with
some of this. A lot of the work is also in conjunction with a startup based
in salt lake called Fusion IO, and their product is they call it the I/O drive
and it's a flash disk actually. A little bit different than some of the existing
flash drives and it sits on a PCI-E bus rather than behind a serial ATA or
SCSI bus. But the numbers that I'll talk about are for their device.
But before I dive into details, I want to talk just a little bit about flash. I'm
sure most of you are familiar with it, but just kind of make sure we're all on
the same page. About two or three years ago Jim Grey talked about why
flash and made the observation that tape is dead, disk is more and more
tape like, flash has the opportunity to replace disks and that of course
locality in RAM is paramount for performance.
So why flash? Well, it's non volatile, it has no mechanical components, so
we don't worry did the mechanical component scaling, you don't have to
worry about seek times scaling at much lower than Moore's law. It's
relatively inexpensive and it's getting cheaper. It has the potential for
significant power savings, although in practice there is -- I've seen some
controversy over existing flash implementations and their power savings.
And when Jim Grey made this observation in 2006, he talked about 6000
I/Os per second currently devices, combination of innovation and the
packaging of multiple flash chips and in firm ware and device drivers
improve that basically by a factor of 10. So the way I like to summarize it is
if you need -- if you're looking to optimize dollars per gigabyte of storage
you look at disk and if you're looking to optimize for number of dollars
spent per I/O per second, flash is probably where you want to start looking.
So the question a lot of people ask, well why not just battery back D RAM?
Well, actually flash cost is getting to the point where it's cheaper than D
RAM per byte. Both markets are pretty volatile. I was looking at the spot
market rate for both of these as of last week and combination of new
popular consume devices like the iPhone coming out and just the general
economic situation have caused a lot of volatility. But more to the point,
the memory subsystems that support D RAM are actually pretty expensive
if you have a large memory, a memory subsystem to support a large
volume of D RAM.
And so one way to think of it is flash is just another level in the memory
card, kind of insert it between disks and main memory. This graph here is
a little bit out of date, but it shows the price of D RAM and the price of
NAND flash. And not only is NAND flash crossed over but the gap the
growing, and that's kind of the point I want to make here.
So just again a quick review. Flash is non volatile solid state individual
cells are comparable in size to an individual transistor. It's not sensitive
mechanical shock, so it's popular for a lot of consume devices and
historically popular in a lot of military and aerospace applications.
However, it requires rewriting a block of flash requires a prior bulk erase
operation, so that means that to basically you can only set a bit, you can
only set bits it can't clear them unless you do a block or a bulk erase on a
large number of bits. And something that's make a lot of in the literature,
of course, is that individual cells have a limited number of erase or write
cycles.
There are two categories of flash is NOR flash and NAND flash. We're
going to talk primarily about NAND flash. More flash allows random access
and it's often used for firmware primarily because you can use execute in
place. You don't have to copy it into RAM before you execute from it. But
NAND flash has higher density and is more typical in mass storage
systems.
Another kind of dichotomy that's an important one is the difference
between SLC and MLC flash. SLC flash is typically more robust, has more
right cycles before it wears out and is typically higher performing but is
lower density. MLC there are multiple voltage levels in individual cell that
allow you to code more than one bit per cell.
The devices I'll be talking about today they have versions that are built with
MLC, but the one that I've been using is an SLC device. Okay. A little bit
more about the economics. It said the individual cells are simple and has a
couple advantages. It improves fabrication yield. And NAND flash is
typically the first thing off -- the first thing to use a new process
technology. So when the shrink die sizes NAND flash is typically the first
thing that is fabricated on the new line.
Moreover, since blocks of flash naturally fail and that's expected, you have
to be able to deal with failures so chips come from the factory with defects
already on them, and they're marked as defects, so that means the yield is
further improved as opposed to a processor where if you have a fabrication
defect in the ALU, well, you just got to toss -- by and large you have to just
toss the whole die.
And of course the high volume for many consumer applications is also
helped to force down the cost. As for the organization of NAND flash, data
is organized in individual pages. And so read and write operations, read
and program operations happen at the level of a page. Page can be -- it's
typically anywhere from 512 bytes to four kilobytes. Those pages are then
organized into erase blocks and erase blocks vary widely in size anywhere
from 16 kilobytes to there are a few devices such as the Fusion IO which is
actually using 20 megabyte erase blocks. So they will do a bulk erase on a
whole 20 megabyte region which consists of many, many pages of course.
But you can't reprogram used pages in an erase block until the entire erase
block has been erased. Okay. So some of the challenges in using flash
then is that it's block oriented, so it looks a little bit like a disk in a block in
that sense. Reads and particularly writes occur in multiples of the page
size. Typically you can't program just a part of a page. That's more of an
interface issue, but you typically program the entire page and you erase the
entire erase block. Because erase blocks are bulk erased, updates instead
of happening in place are typically done by copying. So that means that
you introduce a level of indirection and if you want to rewrite the logical
block 5, you take up -- you read in logical block 5 wherever its physical
current location is, make the modification and write it to a new physical
location and then update and index that tells you where -- what the physical
block corresponding to logical block 5 is.
There are a limited number of erase cycles and that requires wear-leveling
because you don't want to keep hammering on the same physical block,
you want to spread the writes across the physical blocks. But that's not -a lot was made of that in early literature. It's not in practice a big deal
because since you're doing copying to support updates anyway, it's fairly
natural to extend that to allow you to do ware levelling, and so that's not a
huge additional issue.
The built in error correction, a lot of these hardware chips is not, not
sufficient. Unfortunately what little I know about it is all under NDA, but
with Fusion IO, but they have some really, really bizarre failure modes that
it really does require additional error correction either in the firm ware or in
the device driver. And also good performance requires both hardware
parallelism and software support in the device that we'll be talking about
today that software support is split between the firmware on the board and
the device driver that is running in the operating system. Exactly where
that line should be drawn I think is a question that's -- a research question
in the long term. They've chosen one particular place to draw that line.
You could imagine pushing more of it into the firmware or into a processor
sitting on the board or you could even imagine pulling more of it up out of
the device and making the device dumber.
But exactly where the right spot is I think open to question. Okay. So I
think maybe the best question is why another file system? There are an
awful lot of them out there. There are a lot of them that are designed
specifically for disks, FFS, the various Linux file systems, the Ext2, Ext3,
Ext4 now. SGI is XFS Veritas is file systems and of course Microsoft's is
file systems.
FAD is probably the most common one on flash at the moment just
because that's what used in embedded devices but there's a wide variety to
choose from. Obviously these things, most of these file systems are really
designed for disks and not for flash, so there is a question of you have a
layout and a block outlook here that was designed for disks and not for
flash. And moreover the firmware on the flash disks is already
implementing a level of indirection to support wear-leveling, copying and
block allocation. So you have two allocators basically, you're running the
file system allocator on top of the flash device's allocator, which doesn't
seem ideal.
There are also a number of file systems designed specifically for a flash,
includes JFFS and YFFS. JFFS is designed specifically for NOR flash, the
other for SLC NAND. There are logs structured shall and they implement a
lot of the features that I just talked about, but they're entirely intended for
embedded applications and they have -- they are interested in limiting
memory footprint and dealing with small systems in general. So that
means in practice people are running things like Ext2 or Ext3 in an
enterprise environment, server environment, and so you end up with two
allocators and that's an opportunity. Okay.
So the idea behind our file system is to instead of running those two
storage managers just to let the device do it. It's already -- it already has a
storage manager. Why not let it do the work. The file system remains
responsible for directory management and access control, but the flash
disk is figuring out -- is responsible for figuring out what blocks on flash
allocate to a particular file and doing all the copying when we want to
rewrite blocks, so on. I think the longer term question here makes sense
for file systems. I'll talk a little bit about -- toward the end about where we
anticipate going with this next and the longer term question is what should
the storage interface look like? In this case we've taken advantage of a few
features of an existing flash disk and used that to build a file system.
But there's no reason that we have to think of flash as a disk. All right.
Using a traditional block based disk interface. In fact, the Fusion IO device
sits on the PCI-E bus, uses 16 lanes of the PCI-E to talk to the operating
system. But -- and it exports a disk interface. But that disk interface is
actually exported by the device driver running the device and it's not
speaking to the device across the PCI-E bus using a block based disk
interface.
So there's no reason, you know -- may well make sense to expose more of
that interface to individual applications. The reason they haven't done that
yet is that if you're a startup you need to be able to sell, sell devices right
away, and everybody already is set up to use a block interface, and so
that's the natural first interface to use. But that doesn't mean that even
Fusion IO thinks that that's the best interface in the long term, especially
for high performance. Okay.
So what at a bear minimum does this file system that I have described in
very general terms require? Well, it currently relies on four features of the
flash disk. One is and perhaps the most important is that it relies on there
being a sparse block or object base interface and so the reason we settle
on this name is it's for a virtual storage file system, kind of a level of
virtualization going on there. And so you can set up the Fusion IO disk to
use 64 bit blockage. So we have let's say 160 gigabytes of flash but we can
address that as -- with a full 64 bits of address base. And the device driver
and firmware layer will figure out how to map that sparse 64 bit address
base on to 160 gigabytes of actual physical flash.
So what we can do is we can use those extra bits of address base and
partition them and treat them as an object identifier and an offset within an
object. It's kind of a crude approach, but they don't -- they haven't exported
their observe base interface outside the device driver yet.
Another thing that we depend upon the flash disk for is that block
allocations are crash recoverable, and by that -- what I mean by that is that
if you pull the plug on the flash device that it will come back up in a
consistent state. So that it may forget about a right that you have made,
but you will never get -- you don't have to worry about the consistency of
the index mapping virtual address base to physical blocks on the flash
device. So not only have we delegated the block allocation, we've also
delegated a lot of the functions that need to go into crash recoverability in
a file system.
Not all of them, because the file system metadata data directory data
structures and so on, but individual block allocations we don't have to
concern ourselves with logging. Now, a third feature which is one I'm not
going to talk about today because it's still -- I'm still working with the
engineers at Fusion IO on this part of it, but they have an atomic multi
block update which means that because they're already doing logging, I
can arrange so that I can set -- have updates to two separate logical blocks
that either both happen or both don't happen. And that the useful for doing
directories.
As I said, this is work in progress, and since that isn't fully implemented,
I'm not going to talk about directories so much today.
>>: [inaudible].
>> William Josephson: That's right. So they can two -- two logically disk
contiguous. Where they end up, I don't know because that whole mapping
is in a black box. I think actually in the longer term the ability to do this
and possibly even to do it in a distributed fashion, they've talked about
having multiple of these devices and offering a distributed primitive for that
is very interesting. Whether the latter is really practical is open to
question. But on a single device they certainly can do it and it could be
very interesting for a lot of applications to have that atomic update.
And then the fourth thing is what they call a trend. This actually does exist
on a lot of flash devices already. Basically you say I'm not going to use
this logical block or this range of logical blocks anymore. So that for
instance truncate in the file system is implemented by saying here is this
range of blocks that represent this file. You can throw them all away, and I
want to see only -- you know, I want you to pretend that these are
unallocated. So the garbage collector can come along and reuse the
physical blocks that correspond to that logical block range.
>>: [inaudible] try to summarize [inaudible] and when you talk in
[inaudible] system, so I mean, this give basic interface like a large set of
virtual blocks. Do you understand? I mean, maybe this is let's say to the
power of [inaudible] gigabytes ->> William Josephson: Right.
>>: And [inaudible] 160 bit address space. But there's only a few -- only a
small number of blocks [inaudible] use.
>> William Josephson: That's right.
>>: And I mean, you can make sure it's flash recoverable, you can do
[inaudible] update and you can say okay use these blocks ->> William Josephson: That's right.
>>: [inaudible] I mean conceptual space ->> William Josephson: That's right.
>>: [inaudible].
>> William Josephson: That's right. To put it in very concrete terms in
terms of one -- and I find it often useful to have an implementation in mind
when I think about it. It may not be how it's actually implemented but just
an implementation in mind you can just think of this there's an index that
maps virtual to physical, and I implement it may be as a B tree with write
ahead logging and that gives you a lot of these one way you could imagine
implementing this. Whether that's the right way or it becomes -- that's
open to question.
Okay. So I think this next bullet I've already talked about in the context of
the first four. Okay. So more concretely how do we represent files in this
new file system? Well, a file is represented by an object or a sparse block
range. It may be that the object interface is better in the longer term. I'll
explain why I think that might be the case later, but suppose that the device
has a sparse address space of 64 bits. Now, 64 block -- 64 bit block
address base, so if you have 512 byte blocks, it's, you know, what, 73, 73
bit byte level. Address base. Reserve the most significant 32 bits of the
block address to represent an inode number and the least significant to
represent an offset, a block offset within the file.
A very simple, very simple approach. Create and truncate require you to
update directory metadata and as I said, you use trim to implement
truncate and at the moment this means since the atomic multiblock update
isn't available, we do actually have to do a little bit of logging here, that's
not something we anticipate in the long term, that's just implemented
device. Right we get crash recovery from by delegating to the device and
the file contents can as is often the case with file systems, we make the
responsibility of the application.
So that means that if you -- the file system guarantees that when you come
up you're not going to get blocks from another file, but it doesn't mean that
you'll necessarily get all the blocks that you thought you wrote to the
device unless you call fsync or make it all consistent. So there's -- the
mapping is guaranteed, but if you have -- if you haven't waited for all the
I/Os to come back, then they may or may not have made it. But that's fairly
typical for file systems. As I said, directories are -- is work in progress and
directories aren't implemented the way we'd like them to be yet pending
some hardware and some -- well, some software work at Fusion IO. Our
current thought is implement them as sparse hash tables rather than to use
the FFS approach of having just a list basically. The FFS, as you recall -- or
UFS as you recall just keeps a list of entries that say -- that have a file type,
a file name, and an inode number and just keeps a will it in a file -- directory
is just a file containing these little entries.
It doesn't scale very well. And a lot of files are using B trees. It seems that
given what the fact that we already have this sparse address base for files
it might make sense to just hash the file name and then use that hash as an
index into this sparse address space for the file. But I can't say anything
about the performance of that yet. Okay.
So I'm going to talk about a little greater length since the basic idea is fairly
straightforward, I'm going to talk a little bit about how we -- what kind of
performance we've gotten from this and then what that might say about
what a better interface to flash would look like. So evaluation platform is a
fairly recent version of Linux. It's running on a four core machine with four
gigabytes of DRAM. The four core will actually notice in the performance
numbers. There will be some cases where four appears. So that's
something to remember.
The Fusion IO device is 160 gigabytes formatted of SLC NAND. There's
actually of course more flash than that, but that's the physical space
available for data. Sits on the PCI-E bus, the advertised hardware
operation is 50 microseconds and the theoretical peak throughput is
120,000 I/Os per second, as we'll see we don't get very close to that, that,
and there are a number of reasons. One I'm aware of is there's some
locking issues in the device driver, but it's open to question just how close
we can get.
I mean, even if you just open as a raw device and just do block I/O directly
to the raw devise, you're not going to get 120,000 I/Os per second at the
moment.
>>: So you how [inaudible].
>> William Josephson: Well, as we'll see in a few slides, it depends on
whether you're doing reads or writes. There is a big difference there. For
reads something in the low to mid 90,000 I/Os per second is achievable.
Another issue I'll talk about a little bit more is it also depends on how many
concurrency you have. Single-threaded performance is very different from
multi-threaded performance because there's basically there's a pipeline,
and you need to fill the pipeline. So latency is not what the actual latency
as opposed to the theoretical latency is not ideal, but if you fill up the
pipeline you can get a lot of I/Os per second in aggregate.
>>: [inaudible] utilization?
>> William Josephson: CP utilization is actually something I do want to
talk about because that's one of the -- one of the advantages of the simpler
approaches is we actually can get slightly better performance at slightly
lower CPU cost. I think that's what makes it a little bit interesting.
>>: Is that drive there then they're using the PC card slot?
>> William Josephson: It's not a -- you know, it's a half PCI-E form factor.
It's not -- PC card to me means a little thing that fits into ->>: [inaudible].
>> William Josephson: Yeah, it's an actual slot on the motherboard. They
actually have a couple different form factors.
>>: I've seen an [inaudible].
>> William Josephson: They have -- they also have one that fits into -- with
HP that would deal with HP. They got one that fits in on a slightly different
form factor PCI-E bus on HP motherboards.
>>: You said earlier debating between the process of which [inaudible]
device first, the host? In your slides both files and directories, is that all ->> William Josephson: That's a great question. I'll try to answer that a little
bit later. Let me start stepping through that. But that's a good question. I
think that there are -- there are kind of three components. There's the
hardware, which the hardware parallelism is one part of the magic sauce
and that obviously was on the PCI-E bus. There's firmware and actually a
power PC on the device. And the firmware you know, is a question of what
belongs in firmware as opposed to the device driver. And then above the
device driver you have the file system. And so I think the question of
where to draw the line is what we've decided is take a lot of what's in the
file system and push it down into the device driver at the moment. But you
could imagine then taking what's -- much of what's in the device driver and
pushing it even farther down on to the card running on that power PC.
>>: I only ask because it seems like direct [inaudible] user perception ->> William Josephson: That's right.
>>: Why wouldn't that live on the host?
>> William Josephson: That probably would live on the host. But one
thing you could man doing is once you set up access to a file deciding then
to do RDMA to the file itself and only doing access control and so on in the
file system. And I think that that's ultimately -- especially when we have
multiple of these cards sitting on a chassis and may be shared that's where
we're going, I think.
>>: [inaudible] anything from the [inaudible] that gives them the data?
>> William Josephson: So not explicitly. I'm not ->>: [inaudible] parallel [inaudible]?
>> William Josephson: I'm sure it is. I mean, I think -- I think this is a
natural thing to do. I have been talking with some folks at Sandea
[phonetic], and they claim to try to do some similar things, not with Fusion
IO's difficulties and not with flash and not seeing a huge speedup by doing
this RDMA trick. But it seems like a natural thing to do. So at least from a
research perspective I think it's worth exploring some more.
Let me step a slide or two, and then we'll have time for more questions.
Okay. So the preliminary micro -- the preliminary performance evaluation
when I talk a -- use the I/O zone which is a fairly well-known UNIX tool that
just does a whole lot of I/Os with a certain distribution and we'll take a look
at that. One of the things as I said you have this latency versus throughput
issue with the device, so we'll take a look at I/Os per second is a function of
a number of threads. Also compare with a couple existing commonly used
file systems on Linux, just for those of you who aren't familiar with it. Ext2
is pretty much like the old UNIX file system and if you crash you can be
screwed and you probably will be. With Ext3 there's logging and you
should get, Ext3 is more or less the semantics that we're trying to provide.
As I said, directories aren't fully implemented so we don't quite meet that,
but I'm also not going to present numbers on the directory performance, so
I think it is a fair comparison.
And the other issue going -- I can't remember who asked this, but you
know, it was the question of what's the CPU overhead. And I think there
are two issues to remember here. One is that there's a fair amount going
on in that device driver, and so the question -- one reason -- one reason for
being interested in pushing it down is to get that --off load that from the
host CPU. And then there's also the question of toss the file -- what the
impact of the file -- just the file system code let alone the device driver.
Also talk a little bit about memmap. And in that con the text, I'll also talk
about how many context [inaudible] seeing a difference in the number of
context which is between these different file system implementations. And
then finally I'll talk very briefly about a -- building a 64 gigabyte hash table
on flash and what kind of performance we're seeing with that. So it's kind
of a neat little thing there.
>>: Can you explain what recursion I/O device what is in the interface that
the computer [inaudible] talking to that device? [inaudible] hash table or
something?
>> William Josephson: Let me see. I have to think about this a moment,
because a lot of it's under NDA unfortunately. I think the simple answer is
that there's a proprietary interface between the device driver and the
hardware device. A block device interface exported from the device driver
to the kernel and I'm talking with the Fusion IO folks about trying to figure
out if we were to export some of what is currently that proprietary interface
what should it look like as opposed to what it currently looks like. I think
the answer is that like many startups it's maybe not the most polished
interface, it's the interface that works. I'm not sure that is a satisfying
answer, but ->>: I think the content of the [inaudible] you can I mean just I mean
[inaudible] hash table or something ->> William Josephson: I think the tree, the model that I gave you of it being
a tree is a much better model to think about, because some of the locality
you tend to get with the tree is actually an important part of how they get
good rate performance. Because what they're -- what -- as you observed
when we were talking earlier, flash has relatively poor write performance,
but if you get sequential updates, you get much better performance, even
with flash, and so there's a lot of work going on using a B tree as the index
to try to get sequential updates. So it's log structured and that's -- does
that answer your question maybe?
>>: I think I mean still I may be a bit confused about [inaudible] your
contribution or the file system and what's [inaudible]. That's basically.
>> William Josephson: Fair enough. So I think ->>: [inaudible].
>> William Josephson: So far the ultimate goal is to figure out how to use
this device for things like databases and running Oracle and that's the next
step actually have some [inaudible] we have a simulator, we have some
ideas about what that should look like and I need to finish the
implementation before I can tell you what the answer is.
Right now the observation is or -- as I'll show you that given the work that
has to be done to make a flash difficulties go and perform that it's probably
best to get out of its way. And so that a very simple file system -- I mean,
this file system is less than 3,000 lines of code, compared to Ext2 which is
18,000 lines of code. And so the -- with, you know, one person working -who country really know the Linux kernel that well as opposed to other
kernel can in a month write a file system that actually will perform better
than existing ones.
>>: [inaudible] Fusion IO [inaudible].
>> William Josephson: A lot. So I mean and that's a good point. But my -what I'm saying is I think that it makes sense to separate the block
allocation component and that -- they are -- they have -- they would have to
provide a high performance device, they are trying to perform a high
performance device, and they have particular layout on the device -- you
know, they have a particular hardware architecture on their device, and so
they can tune their device driver specifically to that piece of hardware
whereas if I'm writing a file system for a commodity operating system, I
want to work with a lot of different devices and so that you can innovate in
the device driver and in the hardware in the firmware separately from the
file system and that that's the right place -- my argument is that that's the
right place to do it, not in the file system for this type of device.
>>: [inaudible] dealing with this, throwing away all this [inaudible] in the
operating system that you have. [inaudible] there's two allocators and
you're saying toss one away, and it seems like two options are you toss
away the flash allocator and you say let the operating system try to make
good decisions about allocation because it knows about the [inaudible]
structures or you toss away ->> William Josephson: Well, I ->>: [inaudible]. And you ->> William Josephson: I agree with you in general.
>>: I was just asking.
>> William Josephson: I argue that the file -- that the operating system -so for one thing, it is possible to still ask the device driver a lot of the same
questions. You can say -- I mean, you actually have access to the -- if you
wanted to, you could export the access to the -- to its internal data
structure and look into this tree that's representing a range of blocks. One.
Two, how valuable is the block allocator data structures for introspection in
general, in a file system? I mean, the one case -- one of the few cases
where I could see where you really do want to look in is defined holes in a
file when you do backup. And it is possible to do that with the device
driver you can say enumerate all the actual existing pieces. Here's a lock
range, enumerate all the parts that are actually populated. So you still have
that -- you can still peer into it in that sense.
And, yes, the operating system could make good decisions but the
operating system in general doesn't know and the company building the
flash device isn't going to tell them how to figure out how to do allocations
in a way that performs well for their particular device. That's part of what
they're selling. In fact, if you look at Fusion IO, they're going to say that
their intellectual property that's valuable is figuring out how to build the
hardware parallelism, one, but perhaps more important how to develop the
firmware and device driver that gets you good performance on that system.
>>: [inaudible] and then you have no reason to differentiate [inaudible].
>> William Josephson: Right.
>>: So [inaudible].
>> William Josephson: And so this allows the user to innovate separately.
May not buy it, but that's my argument.
>>: One of the other conundrums with pushing functionality in the device,
classical conundrums is as you, you know, push more and more potentially
coding the device, right, by the time they get that done [inaudible] gets to
market maybe a couple generations behind.
>> William Josephson: And that's the ->>: The CPU that you hang it on.
>> William Josephson: That's a very good point. And so the question is -and right now actually there isn't as much pushdown into the device driver
as you might think. I mean down into the device. A lot of it's actually
running in the device driver on the commodity CPU. And so -- and I said,
you know, maybe argue, may be a useful thing to do. I think that the
counterargument is precisely the one you made is that, you know, pushing
in the whole file system on to this device that probably is running an
embedded processor several generations behind may not make sense.
Okay. So this is the pretty picture maybe, but really what I want to you get
out of it is we have in each group we have 1 through 64 threads in powers
of 2. The first three represent -- the blue ones represent write performance
for different kinds of writes and the red represent read performance for
different kinds of reads. And what you see is that write performance peaks
around 16 threads and read performance flat lines between 32 and 64
threads.
And part of the reason for that is that you eventually run the garbage
collector hits a wall when you get enough throughput, and that's around 16
threads. But read performance you basically run into limitations of the
latency per operation and the number of the depth of the pipeline. So the -there's not a lot deep going on here. I just wanted to show you that you -the sweet spot for writes is not the same as sweet spot for reads.
>>: So [inaudible] sequential ->> William Josephson: So the purple line, the farthest to left in each group
is the first write to a file. The file hasn't been populated yet. The next one
is rewriting sequentially. The next one is writing a random -- doing random
writes. And then you have basically the same things for reads in the next
three.
>>: [inaudible].
>> William Josephson: These are all with 4K and you're bypassing the
buffer cache in each case.
>>: [inaudible] is ->> William Josephson: I'm sorry?
>>: The OS here is relevant.
>> William Josephson: As I said, when -- all this evaluation is Linux 2.6,
recent Linux 2.6. That's what I have a device driver for. And they actually
do have a Windows device driver now, I think, but I haven't used it. Okay.
So again since the -- as I alluded to earlier, a lot of our interest is in things
like database performance where direct I/O dominates, that is I/O that
bypasses the buffer cache dominates. So most of what I'm going to be
talking about will be bypassing the buffer cache. And an important thing
that I want to emphasize here. The way I/O zone works is that it -- when we
say we have one thread -- well, let's say we have four threads, that means
there are four processes and there are four files. Each process gets its
own file to do I/O to. And then we'll see an artifact due to that in a little bit.
And again, these are all average numbers with -- I have not -- I haven't
reported the standard deviations here, but they're fairly small.
>>: [inaudible] although a [inaudible] cache it might still advertise large
writes?
>> William Josephson: That's right.
[brief talking over].
>> William Josephson: Well, so there are two issues here. One, if you look
at something like Oracle, I can't speak to other databases, but for Oracle,
actually unless it's a blob or actually typically doing the -- they're going to
do to the cache 4K or 8K writes, for blobs that's right, it's a different issue.
The other thing is a number of parameters here, and I decided on a slice.
I've tried to make that as fair as possible but at some point you're just
going to have to believe me that I'm not hiding something.
>>: [inaudible].
>> William Josephson: I have not. I've just -- just by its default. And in
fact one reason I'm not reporting Ext4 numbers is that I've seen really bad
Ext4 numbers and I don't understand why yet. And I didn't feel comfortable
making fun of them until I understood whether it was my fault or theirs.
Okay. So this is with one thread, and you'll see that there's small
improvements for write and also small improvements for read. With four
threads, again, small or no performance improvement actually on the write
side this is a little bit of an anomaly and the reason I believe is you have
four threads on four processes plus there's also a garbage collection
thread that's running in the background. And as far as I can tell the
scheduler is thrashing a bit.
It kind of goes away as you get oversubscribed with the number of
processes versus physical processors. Okay. And here's with 16 threads.
So on the write side you see a modest performance improvement. These
are, remember in represented in thousands of I/Os per second, and the new
file system is the red bar. Ext2 which doesn't provide crash recovery
guarantees is blue and then the green is eTX 3 which is not set up for
journal entry, it's just set up by the default configuration under red hats
setup. And you'll notice with read there really isn't -- you basically aren't
getting any performance improvement -- any performance improvement to
speak of.
>>: [inaudible].
>> William Josephson: How far? More?
>>: More, yes.
>> William Josephson: Yes.
>>: [inaudible] I know this, I mean except in large basic performance, this
EXFS have the slight [inaudible] so there's two parts, right, I mean the
basically the next three is read, read, read, and [inaudible] first read is
write. Now, in the beginning you talk about the basically [inaudible]
essentially implementing this as an [inaudible]. From my point it will be a
large [inaudible] basically [inaudible].
>> William Josephson: That's right.
>>: And with some interface separating [inaudible]. Right? And I mean
the read, the comparison is basically Ext2 or Ext3 implement on the flash.
For the read performance I assume I shouldn't see too much performance
difference because all this system in a sense is you find the flash block to
read and then basically to just [inaudible] from the flash drive. So I expect
the performance more like the third basic [inaudible].
>> William Josephson: That's right.
>>: And read is basic flat. Could you explain why -- I mean, you were even
against something during [inaudible].
>> William Josephson: It's a good question. I'm not sure I have all the
answers that I'd like on that. Part of it is that I don't have -- with Ext3, well
with the new file system I can compute given a block -- if I do a block I can
compute, if I get a logical read I can compute exactly what flash block to
look for and with Ext2 and Ext3 I can't do that in general because I have to
look -- go look through a bunch of indirect blocks.
>>: Okay.
>> William Josephson: And look that up.
>>: So maybe [inaudible] T3s case, I read a first read I know and then I ->> William Josephson: That's right.
>>: I could basically ->> William Josephson: And it may be that even if -- even if those -- if you
think of it as a tree, those internal nodes may actually be [inaudible] I'll find
them and do some locking to access them. So I think that also the -- I'm -because this new file is so simple I can actually get away with some is
simpler locking in the DFS interface in 1X, and I don't know is it an inherent
think or not? I really couldn't tell you unfortunately. It's something I need
-- I need some more introspection into what's going on in the kernel. But
my suspicion is that it's a combination of locking and not needing to look
at indirect blocks, I can compute exactly what block to request from flash.
The mapping is just, you know, a multiplication and an addition rather than
looking at -- looking through some block.
>>: I think it's the right [inaudible] file system is the [inaudible] versus Ext
[inaudible] with beta structure.
>> William Josephson: I think that the write side is generally more
interesting for two reasons. One, write performance is typically lower on
flash anyway, and that's really what most people are worried about when
they look at flash is they're worried -- more worried about write
performance than they are read performance, particularly random write
performance.
>>: Is there some way that you could --
>> William Josephson: That's something I've been talking to the
developers at Fusion IO about is there a question of what should the
priority of that thread be and also whether or not it would make sense to
start pinning these things so they don't bounce from one processor to the
other. And they report to me that some this of their tests by pinning it or
manipulating the priority that they do get somewhat better performance,
but I don't have that version of the driver at the moment.
Okay. So this is -- I can't remember who asked this question, but it was
some question about CPU overhead, and I have a -- give you a -- what I'm
looking at is so remember there are four processes, so under Linux it
means you can have 400 CPU usage. So 100 percent just so we're kind of
make sure we have our units right. But for across the operations we've
been looking at typically the CPU utilization is somewhere between one
and a half and three and a half percent of CPU for every thousand I/Os per
second delivered. And so what I'm looking at when I say that is I take -- if
you -- in UNIX get our usage and you can ask for wall time, elapse, user
time elapsed an system time where system time is time spent in the kernel
in behalf of your request. And so what I'm looking at here is user plus -user plus system time normalized by wall time. For a thousand I/Os per
second. And so what we see is that particularly pronounced for low
numbers of -- for lower concurrency, but again looking at 4K direct I/Os and
I'm not -- and this particular table we're not looking at any change in
number of I/Os per second delivered, we're just looking at percent CPU, the
change in the CPU utilization when moving from Ext3 to the new file
system, DFSS. So that means that in addition to reducing the CPU, you're
also in general getting better performance. But this doesn't reflect the fact
that you're getting better performance. You're getting that better
performance at a lower cost as well. This is just the change in how much
CPU is used for one to 16 threads for random -- for reads and for writes.
So again there's a little bit of something funny going on with four threads,
and I think that, you know, that's something that perhaps by looking at
priority of the driver thread or pinning threads we might be able to address
that further. But aside from that, there's a fairly clear trend. So it's cheaper
as well.
>>: [inaudible].
>> William Josephson: I this think that we'll get some insight into that
when I look at this hash table, at least for the read side. And it looks like a
lot of what's happening is that you got to ask for a lock and you can't get it,
and you get rescheduled.
>>: [inaudible].
>> William Josephson: This is actually a very, very simple approach.
Literally each I/O request goes all the way down to the device and all the
way back up, and however many threads is shown in the left column doing
this. And there's nothing fancy going on at all.
>>: I don't know [inaudible].
>> William Josephson: So one problem that I have seen -- so there are two
approaches that you could do, you could imagine using in Linux. One is
just having individual threads issuing a read or write system call, and that's
what I'm showing here. The other option is this POSIX defines this as
asynchronous I/O interface. Asynchronous I/O interface actually comes
from Oracle. They are the ones that really pushed for that, especially in
Linux. That's what they often use. And it turns out that there's some
problems with the AEIO implementation, and I don't know whether they're
in the kernel or in the device driver yet. This is a common complaint,
particularly with Fusion IO, and there's some work to be done there from an
implementation standpoint. Because AIO actually delivers less
performance with Fusion IO's devices across all file systems than a
multithreaded approach to us at the moment. And that does need to be
fixed.
>>: [inaudible]. [laughter].
>> William Josephson: I don't -- I mean, it's one of those things where I
don't know the answer. So ->>: [inaudible] context switching overhead and ->> William Josephson: I think that ->>: You use [inaudible].
>> William Josephson: No. This is all because the device from -- Fusion IO
does have some prototypes using MLC NAND, but for one thing a lot of
their customers are enterprise customers, and they don't want to see MLC.
And so because it's cheaper, a higher density, but it has higher failure rates
an lower number of write cycles.
>>: Just [inaudible] one more second.
>> William Josephson: Sure.
>>: So that 1.5 to 3.5 percent CPU is per CPU.
>> William Josephson: No, it's -- so, as I said, this is a little strange. The
way Linux works when you ask for CP utilization, you get -- if you have four
processes, you can have 400 percent CPU utilization. So you really -another way of thinking, this is one and a half to three and a half out of 400.
It's a little bit -- a little bit odd. Perhaps I should have renormalized them
but I haven't, because that's the way that the operating system reports it.
>>: [inaudible] 400 percent.
>> William Josephson: Well, the reason I wasn't entirely -- I didn't
necessarily want to normalize this, also will, of course it's not like you
suddenly have 400 percent -- the 400 percent is a little misleading because
to get 400 percent you've got to figure out how to divide your job into four
pieces that are parallelisable. Okay.
So this is from memmap and actually you see a little bit different story with
memmap. So this graph, what I'm showing is we have I/Os per second
delivered, CPU per I/O and wall time. So for I/Os per second we're seeing
an increase. That's how many -- when you move from Ext3 to VSFS, you'll
see an increase in the first place rewrite 31 percent more I/Os per second
delivered. You'll see a 38 percent reduction, so the sign is different. 38
percent reduction in CPU per I/O. And a 24 percent reduction in the wall
time.
And again we're looking at rewrite, random write, reread, a random read for
one and two threads. I have similar numbers for one, two, three, and four
threads. Beyond four threads, Linux fell over. So -- and I haven't -- I
haven't tracked that one down. It just seems to be a bit more than it can
take. Some bugs in the kernel. So memmap actually the performance
difference is more significant. I did exclude a first write with memmap
because usually it's at least with UNIX [inaudible] system it's a really bad
idea to do a first write with memmap. It tends to really scramble the file
system. You really want to fill the file with zeros and then do the memmap.
So in this case, we're looking at -- that's a great question. 32 gigabyte file
on a machine with four gigabytes of RAM. So that's a very good question.
Something I should have had there. So you know, the DRAM side is large -is fairly large compared to the total size of the file. Okay. So the last thing I
want to talk a little bit about, microbenchmarks at least my way of thinking
is mostly to convince yourself that there might be something there and not
to prove that there's something there. The first -- I have a few more
realistic microbenchmarks from Sandea [phonetic] that I haven't had a
chance to run yet that are more of an HPC nature. But one of the things
that we've been doing in our work is looking at -- we're interested in very
large data sets in general. Some of those are text data sets and many of
them are not. But a common problem in both cases is how to build an
index for that data set. Particularly when the data set may be so large that
the index doesn't fit comfortably in DRAM. So we took a look at just one
particular one. In this case, we used the Google n-gram corpus. These are
n-grams, words found on the web and for easement n-gram, they actually
have one, two, three, four, and five grams and they have the n-gram and its
number of occurrences on the web. And this is pretty -- this is used fair
amount in variety of machine learning or computational linguistics
problems. And a common problem is that it's just too big to fit in D ram on
a work station.
There are 13 and a half million one grams and 1.1 billion 5 grams. And so
what's fairly common to do is to take the actual n-gram and map it to an
identifier and the identifier can fit in 24 bits because they're 13 and a half
million for one grams and then that gives you a 15 or a 16 byte identifier for
the five gram. The memory footprint of the result is pretty close to 26
gigabytes of data. By being clever with your encoding, you might be able
to reduce it further, but it's large. And it's too large for most work stations.
So one approach that a lot of people have taken is to use an -- some
approximation method and have approximate queries. So that I will take a
five gram and they'll submit it to their index, and they'll get an answer
within -- with some probability of error.
Another approach that a lot of people have taken is just get a bunch of
machines with enough DRAM in aggregate and then broadcast the query or
figure out an assignment of n-grams to machines and send the query to the
right machine.
Another approach of course is it's still small enough that you can easily fit
it in memory if you buy a large enough machine. There are after all SGI
machines with a terabyte of DRAM that you can buy. But our observation
is that in general memory subsystems are expensive. It's not just the cost
of DRAM. When you look at this, you have to consider not only the cost of
DRAM but also the cost of the memory subsystem that goes with that.
And then another approach, and the one I'm going to talk about, of course,
is that you can of a work station with a moderate amount of DRAM, two,
four, four, six gigabytes of DRAM and a flash disk and put the index on a
flash disk. So that's what we did. So the design for this is really fairly
straightforward. I'm sure that -- well, I know that one could do more
optimized design. You just divide a large hash table into fix size buckets of
four kilobytes and sort all the keys in each bucket. I can precompute the
occupancy histogram by using a single hash function so I know how big it
has to be so I can guarantee that there are no overflows, because it's not a
case where the dynamic updates I have the full set of keys in advance.
I can keep a small cache of these blocks and pin them in memory to avoid
talking into the client, so I can just lock -- lock a cache block into memory
and hand a pointer to the client. And I can -- the I use either a clock or in
this case either a random replacement to avoid a single lock on an LRU
chain. Obviously there is a well studied problem. We know how to
paralyze it better than this, but depending on your query distribution, it
may be that random is just fine. If you have a very low hit ratio anyway,
why go to the trouble of doing something sophisticated, one, and, two, a lot
of the time these sorts of applications are not written by systems folks.
You have to remember that a lot of these are done by people who want to
do a solve a machine learning problem, they want to solve HPC problem
and are not, you know, the scientists or machine learning people, they're
not systems people. So I think in the long term the question is are there
other primitives that we can provide to them to get -- so that they can get
better performance having to implement something that's sophisticated.
But given all these simplifying assumptions that I've made, the initial hash
table construction is problematic. You have 1.1 billion inserts to do and
you're getting let's say a hundred thousand write I/Os per second, so the
obvious thing to do is just generate -- I can generate the file of key value
pairs and their hashes very easily and I can just sort it, and I can just insert
and sort an order and then I get great hit rate even in a tiny cache and I can
actually build the hash table in a time comparable to just copying it, which
is on the order of a half hour. Certainly with optimization you could do the
better, but one thing you can do.
And then, you know, I -- one of the steps here is you do have to do that
sort. And of course there are a lot of external sort programs. But why not
just see what happens if we memmap the file and call a qsort. And makes
for an entertaining -- it's sheer laziness. It works. You have to do it once.
And it actually presents kind of an entertaining little pathological test case.
And here's the results. In this case, so I could run it a bunch of times and
not have to wait a day, we just did the first 65 percent of the data. So the
first -- these are times for the first 715 million keys using a -- it's an
optimized but single-threaded qsort. The difference is that unlike
[inaudible] qsort there's not an indirect function call for the comparison,
that's actually in mind, but that's the only change.
And so what you have here is the wall time is blue is the new file system,
VsFS, green is Ext2 and red is Ext3. And remember that VsFS is providing
crash recovery guarantees closer to Ext3 than Ext2 so it's not surprising
that it's not necessarily quite as fast as Ext2.
The other thing that this graph doesn't show -- I mean, you see a big
difference in wall time and a smaller difference in system time and virtually
unchanged in user time. The other thing that this graph doesn't show is
that Ext -- when running on Ext3, there are about 25 percent more voluntary
context, which is that's not too surprising who would guess that from the
difference in wall time? That just means that when you're running on Ext3,
you spend more of your time waiting for the operating system or the device
to do something.
So then the question is well what happens if I actually try to do -- to run
some queries and we chose -- I chose two differently query distributions
because obviously the query distribution makes a big difference. The first
one is uniformly distributed, which is probably not that realistic for the
machine learning context. The next one I'll show is for Zipf distributed with
one parameter for the Zipf distribution chosen.
But what we did is run 200,000 queries, in this case choosing the queries
uniformly. The block cache that I talked about is very small. It's only 1,024
blocks. And again is with direct I/O. And so what we have here is wall time
user plus system time and then the number of voluntary context. And
again, this is a percentage reduction so you're going when you move from
Ext3 to VsFS, you see with one thread you see the wall time drop by five
percent, user plus SYS time drop by 20 percent and basically the number of
voluntary context which is unchanged. It looks a little different with 16
concurrent threads.
And again, I think that to try to understand what's going on, we talked
about this earlier, the -- is there additional locking going on in the Ext3
clearly and there is also if they root around through the indirect blocks. So
that's one issue.
Now, for Zipf distributor query I don't really know what a good choice for
the Zipf parameter should be. I did look at a number of values. The
problem is that if it's -- if the distribution is skewed too much, you're
actually looking not so much at performance of the file system but at
performance of a rather simplistic and small cache and so for our purposes
I think that's less interesting. So in this case, using the same hash, hash
table, the same cache implementation, same parameters for the cache, the
only thing that's changed is that the queries are Zipf distributed and the
particular choice for alpha we made is this 1.0001. But to remind you what
that means, that means that there's -- that instead of being uniform, it's a
very skewed distribution. There are something to be some keys are going
to appear much more often in the query trace than others.
And the larger alpha, the more skewed it is. And the qualitative
improvement is similar. One of the things I've -- or that probably is worth
looking at is to see if there's any -- because of the way that the query
distribution is constructed, it's quite possible there is some false sharing.
So in a real scenario it wouldn't make sense that if two -- if you have two
queries that are both popular that they're very likely to end up in the same
hash bucket whereas in this particular run that is likely to happen, but if we
just randomly permute the query and keep the distributions the same, then
we could account for that. This doesn't account for the that. Okay.
Let's see here. I'm just checking on the time. Okay. So here's the part -I'm almost done. This is next to last slide. Really the last slide. And I just
want to -- some musings on what's the next step. Clearly the CPU
overhead of the device driver is a significant problem, especially for some
workloads and particularly the write side suffers from that. As we talked
about earlier, there's some question does it make sense to -- where is the
write line to draw for storage management on the flash device? Or does it
even make sense to push it into the network on the RDMA like interface?
And I ran a few microbenchmarks to see what happens. If you talk to the
file system directly from the kernel and eliminate the context switch you -big surprise. You get a performance improvement. But does that mean
RDMA makes sense? I don't know. One of the fellows who just left here
said, it's some question about is the device going to be able to keep up in
terms of hardware with commodity hardware and so you may get bitten by
the fact that the commodity hardware just is much -- gets faster faster.
And I think that the more important issue is that it's not really any
compelling reason to interact with flashes and order mass storage device.
Does it make sense to -- the exported interface is lose this key value pair or
some kind of hash like index, is that the right way you should optimize that,
provide that as a library or even push it into the device driver?
And then instead of this partitioning of sparse block address space maybe
it makes sense to actually have a first class object store because then you
can attach some additional metadata to each object so that for instant if
you're using it as a cache for a database or for a web proxy or whatever it
happens to be you can associate some additional metadata directly with
the object. You could also do that through the file system but maybe it
makes sense to make that through a library or through the -- through the
system software stack, a first class abstraction. Okay. So I think we've
covered all these points already. But with a little secret sauce NAND flash
is interesting.
>>: [inaudible].
>> William Josephson: Sure.
>>: Would you put that over into what is the compelling ways interact with
[inaudible].
>> William Josephson: Well, that's a good question. I guess to my mind
the thing that's different is that between the -- if I am using flash to build a
cache for a database, for instance, the number of I/Os per second makes it
-- I think you're going to interact with it differently. Now, it may be if you
had a difference, this new storage interface and was backed instead of by a
small number of disks by a large fan with a large cache and a lot of disk
parallelism then the answer may be that you want this different interface for
that as well. I don't think that it makes sense -- but I can imagine pushing
this kind of flash into a laptop and using this new interface on a laptop.
Laptop with a single disk I don't think this -- these interfaces are as
interesting.
>>: I [inaudible] very sure about this when Amazon direct service
[inaudible] very similar API as [inaudible] so I guess the -- do you think this
interface is good and is flash specific ->> William Josephson: I don't think it's flash specific. If that was your -- I
don't think it's flash specific, but I do think that it -- I don't think that it
necessarily makes sense for individual -- for individual consume are disks.
>>: It does a very nice [inaudible] interface for storage but [inaudible].
>>: This goes back to the whole [inaudible] question. This is [inaudible]
semantic functionality from storage functionality and direct access to ->> William Josephson: Still I'm not claiming this part of -- this idea is new
[inaudible].
>>: But I think it makes sense.
>> William Josephson: Yeah, I think it does make a lot of sense any guess
what I'm advocating is that -- is flash -- you know, it's unfortunate to have
flash -- you know, there's no reason to have flash continue to have this
block based interface.
>>: [inaudible] you're hiding multimedia's running behind it.
>> William Josephson: And moreover in the case of flash as we saw, for
things like garbage collection and right wear-leveling and all these things is
a fair amount of mechanism there, and I think that the one thing that may
be different is that it makes sense that somebody already has to go to the
trouble of engineering all of that to get good write performance that it
makes sense to have these -- this interface in that -- in that layer as
opposed to -- as an additional the layer on top of it, because the extra -- the
stacking the abstractions has a cost. And the -- you know, they're not
actually adding -- adding significantly to the burden of the person who is
already implementing that more primitive abstraction. Okay. Well, I think
that -- I think that pretty much does it. Further questions?
>>: Why does flash need to have a number of pages in one [inaudible]?
>> William Josephson: So the big difference between NOR flash, which is
random access and NAND flash is to reduce the number of wire or wires on
the die, and so there actually a lot of the -- and I think that the primary
reason is to make it denser when they fabricate it. It's a fabrication, a
fabrication density issue as opposed to a fundamental issue. It actually
readout from these things is typically serial at the chip level.
>>: So in the first half of the talk you've mentioned that the [inaudible].
>> William Josephson: Yeah. Now, so kind of the conventional wisdom
that I often hear about flash is that when it fails it's a failure to write or it's
fair to erase and so your data is still there and it's not a big deal. Talking
with the Fusion IO guys is it's simply not the case but the actual failure
modes is something that seems to be fairly, as far as I know a fairly tightly
held secret in the industry particularly when the people like Samsung who
are the ones who fabricate the chips. And I haven't been able to find a lot
of good really world information -- information about failure mode. There's
some, some device work that's been done in places like IEEE, but to
actually see what types of failures are happening in the real world in the
enterprise, I don't know of a good study of that, and I think it would actually
be very interesting to know. Because they have found they really do need
to have additional ECC above what's provided by the chips in order to give
reliability that you would expect. And they've tried to describe some of
these failure modes, but they're -- I don't have a paper to refer anyone to on
them. But that's a significant practical challenge apparently. Not
something I know much about. If there any other questions, I think that
that should do it.
>>: I have one additional question. When you implement flash [inaudible]
right, I mean [inaudible] and file system [inaudible].
>> William Josephson: Meaning ->>: Memmapping basically between [inaudible].
>> William Josephson: The virtual address and the actual physical
[inaudible].
>>: [inaudible] if that's also implemented in a [inaudible] in the flash drive?
>> William Josephson: I'm sorry. Say that again.
>>: So I mean basically.
>> William Josephson: The index has to be locked. Is that ->>: No. I mean what I mean is it's -- I know basically for each flash block
our data is for -- they usually have additional metadata in the order of
basically [inaudible] which basically [inaudible] some problems to that
[inaudible].
>> William Josephson: And in fact, what ->>: But if you do that, I mean, basically every time you boot you have to
basically take a long time to read this.
>> William Josephson: That's -- and then that's a real problem. What I -what I -- what most people seem to do in practice if they want performance
is they keep it there and they have an additional chunk of flash. I said it
was 160 gigabyte formatted capacity. It's actually much larger -- not much
larger, but larger than that. And part of that stores a log. And it is literally
a write ahead log of sorts. That's stored, the non volatile portion of which
is stored in additional flash that's not addressable by the user, only by the
device driver.
>>: Now, the question is -- is that additional storage from the flash memory
is still useful? I mean basically apparently you want to store the index in
the certain basic hidden space so that when this pulled out [inaudible].
>> William Josephson: Right.
>>: From the basic flash -- we view basically the index in the [inaudible].
But if you don't actually work through the [inaudible] basically flash drive
and leave basic information attached to the each of the block doesn't seem
to be necessary.
>> William Josephson: The fact of the matter is I'm not intimately familiar
with that portion of the software stack in this device. I think there's also
that's where some additional error correcting information is kept.
>>: Yes. That's fair. Because I know that that space actually [inaudible]
information is together with [inaudible].
>> William Josephson: I think you had another question or ->>: So will we have any [inaudible] RAM on the flash?
>> William Josephson: They do.
>>: So where the [inaudible].
>> William Josephson: They're actually currently held in the device driver
on the host CPU. And that's why there's a -- you know, the garbage
collector is actually running in this device driver. That's why this device
driver is actually an expensive thing is the ->>: [inaudible] so if you ever ->> William Josephson: Well, there's also a processor running on the
device, and when you -- it keeps enough information to do the logging as
the requests come down from the device. And so when you crash and
come back up, there's enough in this separate area where there's a write
ahead log to reconstruct ->>: Okay.
>> William Josephson: A consistent snapshot. May not be the most recent
snapshot of what was held on the host CPU in the driver. So there are a -particularly some of the high performance computing folks find it very
frustrating that a fair amount of RAM and CPU horsepower is used by the
device driver because -- but, you know, I think it's a -- it's a very good point
that in general you're not going to keep up with improvements on the host
CPU in the hardware device itself. Company that size isn't going to be able
to do enough iterations or get enough volume. And the DRAM on the
device itself, although there is some, it's not a lot.
>>: So that's the [inaudible].
>> William Josephson: It is.
>>: So if you have a [inaudible].
>> William Josephson: That's right. So there's a chance to make sure that
everything is written appropriately to flash.
>>: [inaudible] actually [inaudible] you can move all the RAM into the
system and probably get same amount of [inaudible] performance.
>> William Josephson: There are a bunch of design alternatives, and I -- I
just -- since I wasn't involved in the design decision, that's fundamentally
what I'm doing. I don't have a good sense of where the best place to draw
the line is. And of course with a startup, they also have a lot of economic
forces as opposed to just research force in deciding where that line is
drawn.
>> Jin Li: Any additional questions? Let's thank you William for an
excellent talk.
>> William Josephson: Thank you.
[applause]
Download