>> Jin Li: It's my great pleasure to introduce William Josephson. William is an undergrad at Harvard University, and he is currently a graduate student at Princeton University working with Professor Kai Li on how to use flash drives to do storage systems. Before joining Princeton, William has been working with data domain, a leader in video application systems. He has also interned in a number of labs such as Sun and Bell Labs. He also working with institute of defense analysis on a number of locations. Today he will give us a talk on how to build a flash system -- a file system for next generation of flash drives. Without further delay, let's hear what William has to say about. >> William Josephson: Thank you. I want to preface this talk by saying it really is a work in progress. Some of the results are -- is this on? I think so. Some of this is hot off the press as in last week I was still fiddling with some of this. A lot of the work is also in conjunction with a startup based in salt lake called Fusion IO, and their product is they call it the I/O drive and it's a flash disk actually. A little bit different than some of the existing flash drives and it sits on a PCI-E bus rather than behind a serial ATA or SCSI bus. But the numbers that I'll talk about are for their device. But before I dive into details, I want to talk just a little bit about flash. I'm sure most of you are familiar with it, but just kind of make sure we're all on the same page. About two or three years ago Jim Grey talked about why flash and made the observation that tape is dead, disk is more and more tape like, flash has the opportunity to replace disks and that of course locality in RAM is paramount for performance. So why flash? Well, it's non volatile, it has no mechanical components, so we don't worry did the mechanical component scaling, you don't have to worry about seek times scaling at much lower than Moore's law. It's relatively inexpensive and it's getting cheaper. It has the potential for significant power savings, although in practice there is -- I've seen some controversy over existing flash implementations and their power savings. And when Jim Grey made this observation in 2006, he talked about 6000 I/Os per second currently devices, combination of innovation and the packaging of multiple flash chips and in firm ware and device drivers improve that basically by a factor of 10. So the way I like to summarize it is if you need -- if you're looking to optimize dollars per gigabyte of storage you look at disk and if you're looking to optimize for number of dollars spent per I/O per second, flash is probably where you want to start looking. So the question a lot of people ask, well why not just battery back D RAM? Well, actually flash cost is getting to the point where it's cheaper than D RAM per byte. Both markets are pretty volatile. I was looking at the spot market rate for both of these as of last week and combination of new popular consume devices like the iPhone coming out and just the general economic situation have caused a lot of volatility. But more to the point, the memory subsystems that support D RAM are actually pretty expensive if you have a large memory, a memory subsystem to support a large volume of D RAM. And so one way to think of it is flash is just another level in the memory card, kind of insert it between disks and main memory. This graph here is a little bit out of date, but it shows the price of D RAM and the price of NAND flash. And not only is NAND flash crossed over but the gap the growing, and that's kind of the point I want to make here. So just again a quick review. Flash is non volatile solid state individual cells are comparable in size to an individual transistor. It's not sensitive mechanical shock, so it's popular for a lot of consume devices and historically popular in a lot of military and aerospace applications. However, it requires rewriting a block of flash requires a prior bulk erase operation, so that means that to basically you can only set a bit, you can only set bits it can't clear them unless you do a block or a bulk erase on a large number of bits. And something that's make a lot of in the literature, of course, is that individual cells have a limited number of erase or write cycles. There are two categories of flash is NOR flash and NAND flash. We're going to talk primarily about NAND flash. More flash allows random access and it's often used for firmware primarily because you can use execute in place. You don't have to copy it into RAM before you execute from it. But NAND flash has higher density and is more typical in mass storage systems. Another kind of dichotomy that's an important one is the difference between SLC and MLC flash. SLC flash is typically more robust, has more right cycles before it wears out and is typically higher performing but is lower density. MLC there are multiple voltage levels in individual cell that allow you to code more than one bit per cell. The devices I'll be talking about today they have versions that are built with MLC, but the one that I've been using is an SLC device. Okay. A little bit more about the economics. It said the individual cells are simple and has a couple advantages. It improves fabrication yield. And NAND flash is typically the first thing off -- the first thing to use a new process technology. So when the shrink die sizes NAND flash is typically the first thing that is fabricated on the new line. Moreover, since blocks of flash naturally fail and that's expected, you have to be able to deal with failures so chips come from the factory with defects already on them, and they're marked as defects, so that means the yield is further improved as opposed to a processor where if you have a fabrication defect in the ALU, well, you just got to toss -- by and large you have to just toss the whole die. And of course the high volume for many consumer applications is also helped to force down the cost. As for the organization of NAND flash, data is organized in individual pages. And so read and write operations, read and program operations happen at the level of a page. Page can be -- it's typically anywhere from 512 bytes to four kilobytes. Those pages are then organized into erase blocks and erase blocks vary widely in size anywhere from 16 kilobytes to there are a few devices such as the Fusion IO which is actually using 20 megabyte erase blocks. So they will do a bulk erase on a whole 20 megabyte region which consists of many, many pages of course. But you can't reprogram used pages in an erase block until the entire erase block has been erased. Okay. So some of the challenges in using flash then is that it's block oriented, so it looks a little bit like a disk in a block in that sense. Reads and particularly writes occur in multiples of the page size. Typically you can't program just a part of a page. That's more of an interface issue, but you typically program the entire page and you erase the entire erase block. Because erase blocks are bulk erased, updates instead of happening in place are typically done by copying. So that means that you introduce a level of indirection and if you want to rewrite the logical block 5, you take up -- you read in logical block 5 wherever its physical current location is, make the modification and write it to a new physical location and then update and index that tells you where -- what the physical block corresponding to logical block 5 is. There are a limited number of erase cycles and that requires wear-leveling because you don't want to keep hammering on the same physical block, you want to spread the writes across the physical blocks. But that's not -a lot was made of that in early literature. It's not in practice a big deal because since you're doing copying to support updates anyway, it's fairly natural to extend that to allow you to do ware levelling, and so that's not a huge additional issue. The built in error correction, a lot of these hardware chips is not, not sufficient. Unfortunately what little I know about it is all under NDA, but with Fusion IO, but they have some really, really bizarre failure modes that it really does require additional error correction either in the firm ware or in the device driver. And also good performance requires both hardware parallelism and software support in the device that we'll be talking about today that software support is split between the firmware on the board and the device driver that is running in the operating system. Exactly where that line should be drawn I think is a question that's -- a research question in the long term. They've chosen one particular place to draw that line. You could imagine pushing more of it into the firmware or into a processor sitting on the board or you could even imagine pulling more of it up out of the device and making the device dumber. But exactly where the right spot is I think open to question. Okay. So I think maybe the best question is why another file system? There are an awful lot of them out there. There are a lot of them that are designed specifically for disks, FFS, the various Linux file systems, the Ext2, Ext3, Ext4 now. SGI is XFS Veritas is file systems and of course Microsoft's is file systems. FAD is probably the most common one on flash at the moment just because that's what used in embedded devices but there's a wide variety to choose from. Obviously these things, most of these file systems are really designed for disks and not for flash, so there is a question of you have a layout and a block outlook here that was designed for disks and not for flash. And moreover the firmware on the flash disks is already implementing a level of indirection to support wear-leveling, copying and block allocation. So you have two allocators basically, you're running the file system allocator on top of the flash device's allocator, which doesn't seem ideal. There are also a number of file systems designed specifically for a flash, includes JFFS and YFFS. JFFS is designed specifically for NOR flash, the other for SLC NAND. There are logs structured shall and they implement a lot of the features that I just talked about, but they're entirely intended for embedded applications and they have -- they are interested in limiting memory footprint and dealing with small systems in general. So that means in practice people are running things like Ext2 or Ext3 in an enterprise environment, server environment, and so you end up with two allocators and that's an opportunity. Okay. So the idea behind our file system is to instead of running those two storage managers just to let the device do it. It's already -- it already has a storage manager. Why not let it do the work. The file system remains responsible for directory management and access control, but the flash disk is figuring out -- is responsible for figuring out what blocks on flash allocate to a particular file and doing all the copying when we want to rewrite blocks, so on. I think the longer term question here makes sense for file systems. I'll talk a little bit about -- toward the end about where we anticipate going with this next and the longer term question is what should the storage interface look like? In this case we've taken advantage of a few features of an existing flash disk and used that to build a file system. But there's no reason that we have to think of flash as a disk. All right. Using a traditional block based disk interface. In fact, the Fusion IO device sits on the PCI-E bus, uses 16 lanes of the PCI-E to talk to the operating system. But -- and it exports a disk interface. But that disk interface is actually exported by the device driver running the device and it's not speaking to the device across the PCI-E bus using a block based disk interface. So there's no reason, you know -- may well make sense to expose more of that interface to individual applications. The reason they haven't done that yet is that if you're a startup you need to be able to sell, sell devices right away, and everybody already is set up to use a block interface, and so that's the natural first interface to use. But that doesn't mean that even Fusion IO thinks that that's the best interface in the long term, especially for high performance. Okay. So what at a bear minimum does this file system that I have described in very general terms require? Well, it currently relies on four features of the flash disk. One is and perhaps the most important is that it relies on there being a sparse block or object base interface and so the reason we settle on this name is it's for a virtual storage file system, kind of a level of virtualization going on there. And so you can set up the Fusion IO disk to use 64 bit blockage. So we have let's say 160 gigabytes of flash but we can address that as -- with a full 64 bits of address base. And the device driver and firmware layer will figure out how to map that sparse 64 bit address base on to 160 gigabytes of actual physical flash. So what we can do is we can use those extra bits of address base and partition them and treat them as an object identifier and an offset within an object. It's kind of a crude approach, but they don't -- they haven't exported their observe base interface outside the device driver yet. Another thing that we depend upon the flash disk for is that block allocations are crash recoverable, and by that -- what I mean by that is that if you pull the plug on the flash device that it will come back up in a consistent state. So that it may forget about a right that you have made, but you will never get -- you don't have to worry about the consistency of the index mapping virtual address base to physical blocks on the flash device. So not only have we delegated the block allocation, we've also delegated a lot of the functions that need to go into crash recoverability in a file system. Not all of them, because the file system metadata data directory data structures and so on, but individual block allocations we don't have to concern ourselves with logging. Now, a third feature which is one I'm not going to talk about today because it's still -- I'm still working with the engineers at Fusion IO on this part of it, but they have an atomic multi block update which means that because they're already doing logging, I can arrange so that I can set -- have updates to two separate logical blocks that either both happen or both don't happen. And that the useful for doing directories. As I said, this is work in progress, and since that isn't fully implemented, I'm not going to talk about directories so much today. >>: [inaudible]. >> William Josephson: That's right. So they can two -- two logically disk contiguous. Where they end up, I don't know because that whole mapping is in a black box. I think actually in the longer term the ability to do this and possibly even to do it in a distributed fashion, they've talked about having multiple of these devices and offering a distributed primitive for that is very interesting. Whether the latter is really practical is open to question. But on a single device they certainly can do it and it could be very interesting for a lot of applications to have that atomic update. And then the fourth thing is what they call a trend. This actually does exist on a lot of flash devices already. Basically you say I'm not going to use this logical block or this range of logical blocks anymore. So that for instance truncate in the file system is implemented by saying here is this range of blocks that represent this file. You can throw them all away, and I want to see only -- you know, I want you to pretend that these are unallocated. So the garbage collector can come along and reuse the physical blocks that correspond to that logical block range. >>: [inaudible] try to summarize [inaudible] and when you talk in [inaudible] system, so I mean, this give basic interface like a large set of virtual blocks. Do you understand? I mean, maybe this is let's say to the power of [inaudible] gigabytes ->> William Josephson: Right. >>: And [inaudible] 160 bit address space. But there's only a few -- only a small number of blocks [inaudible] use. >> William Josephson: That's right. >>: And I mean, you can make sure it's flash recoverable, you can do [inaudible] update and you can say okay use these blocks ->> William Josephson: That's right. >>: [inaudible] I mean conceptual space ->> William Josephson: That's right. >>: [inaudible]. >> William Josephson: That's right. To put it in very concrete terms in terms of one -- and I find it often useful to have an implementation in mind when I think about it. It may not be how it's actually implemented but just an implementation in mind you can just think of this there's an index that maps virtual to physical, and I implement it may be as a B tree with write ahead logging and that gives you a lot of these one way you could imagine implementing this. Whether that's the right way or it becomes -- that's open to question. Okay. So I think this next bullet I've already talked about in the context of the first four. Okay. So more concretely how do we represent files in this new file system? Well, a file is represented by an object or a sparse block range. It may be that the object interface is better in the longer term. I'll explain why I think that might be the case later, but suppose that the device has a sparse address space of 64 bits. Now, 64 block -- 64 bit block address base, so if you have 512 byte blocks, it's, you know, what, 73, 73 bit byte level. Address base. Reserve the most significant 32 bits of the block address to represent an inode number and the least significant to represent an offset, a block offset within the file. A very simple, very simple approach. Create and truncate require you to update directory metadata and as I said, you use trim to implement truncate and at the moment this means since the atomic multiblock update isn't available, we do actually have to do a little bit of logging here, that's not something we anticipate in the long term, that's just implemented device. Right we get crash recovery from by delegating to the device and the file contents can as is often the case with file systems, we make the responsibility of the application. So that means that if you -- the file system guarantees that when you come up you're not going to get blocks from another file, but it doesn't mean that you'll necessarily get all the blocks that you thought you wrote to the device unless you call fsync or make it all consistent. So there's -- the mapping is guaranteed, but if you have -- if you haven't waited for all the I/Os to come back, then they may or may not have made it. But that's fairly typical for file systems. As I said, directories are -- is work in progress and directories aren't implemented the way we'd like them to be yet pending some hardware and some -- well, some software work at Fusion IO. Our current thought is implement them as sparse hash tables rather than to use the FFS approach of having just a list basically. The FFS, as you recall -- or UFS as you recall just keeps a list of entries that say -- that have a file type, a file name, and an inode number and just keeps a will it in a file -- directory is just a file containing these little entries. It doesn't scale very well. And a lot of files are using B trees. It seems that given what the fact that we already have this sparse address base for files it might make sense to just hash the file name and then use that hash as an index into this sparse address space for the file. But I can't say anything about the performance of that yet. Okay. So I'm going to talk about a little greater length since the basic idea is fairly straightforward, I'm going to talk a little bit about how we -- what kind of performance we've gotten from this and then what that might say about what a better interface to flash would look like. So evaluation platform is a fairly recent version of Linux. It's running on a four core machine with four gigabytes of DRAM. The four core will actually notice in the performance numbers. There will be some cases where four appears. So that's something to remember. The Fusion IO device is 160 gigabytes formatted of SLC NAND. There's actually of course more flash than that, but that's the physical space available for data. Sits on the PCI-E bus, the advertised hardware operation is 50 microseconds and the theoretical peak throughput is 120,000 I/Os per second, as we'll see we don't get very close to that, that, and there are a number of reasons. One I'm aware of is there's some locking issues in the device driver, but it's open to question just how close we can get. I mean, even if you just open as a raw device and just do block I/O directly to the raw devise, you're not going to get 120,000 I/Os per second at the moment. >>: So you how [inaudible]. >> William Josephson: Well, as we'll see in a few slides, it depends on whether you're doing reads or writes. There is a big difference there. For reads something in the low to mid 90,000 I/Os per second is achievable. Another issue I'll talk about a little bit more is it also depends on how many concurrency you have. Single-threaded performance is very different from multi-threaded performance because there's basically there's a pipeline, and you need to fill the pipeline. So latency is not what the actual latency as opposed to the theoretical latency is not ideal, but if you fill up the pipeline you can get a lot of I/Os per second in aggregate. >>: [inaudible] utilization? >> William Josephson: CP utilization is actually something I do want to talk about because that's one of the -- one of the advantages of the simpler approaches is we actually can get slightly better performance at slightly lower CPU cost. I think that's what makes it a little bit interesting. >>: Is that drive there then they're using the PC card slot? >> William Josephson: It's not a -- you know, it's a half PCI-E form factor. It's not -- PC card to me means a little thing that fits into ->>: [inaudible]. >> William Josephson: Yeah, it's an actual slot on the motherboard. They actually have a couple different form factors. >>: I've seen an [inaudible]. >> William Josephson: They have -- they also have one that fits into -- with HP that would deal with HP. They got one that fits in on a slightly different form factor PCI-E bus on HP motherboards. >>: You said earlier debating between the process of which [inaudible] device first, the host? In your slides both files and directories, is that all ->> William Josephson: That's a great question. I'll try to answer that a little bit later. Let me start stepping through that. But that's a good question. I think that there are -- there are kind of three components. There's the hardware, which the hardware parallelism is one part of the magic sauce and that obviously was on the PCI-E bus. There's firmware and actually a power PC on the device. And the firmware you know, is a question of what belongs in firmware as opposed to the device driver. And then above the device driver you have the file system. And so I think the question of where to draw the line is what we've decided is take a lot of what's in the file system and push it down into the device driver at the moment. But you could imagine then taking what's -- much of what's in the device driver and pushing it even farther down on to the card running on that power PC. >>: I only ask because it seems like direct [inaudible] user perception ->> William Josephson: That's right. >>: Why wouldn't that live on the host? >> William Josephson: That probably would live on the host. But one thing you could man doing is once you set up access to a file deciding then to do RDMA to the file itself and only doing access control and so on in the file system. And I think that that's ultimately -- especially when we have multiple of these cards sitting on a chassis and may be shared that's where we're going, I think. >>: [inaudible] anything from the [inaudible] that gives them the data? >> William Josephson: So not explicitly. I'm not ->>: [inaudible] parallel [inaudible]? >> William Josephson: I'm sure it is. I mean, I think -- I think this is a natural thing to do. I have been talking with some folks at Sandea [phonetic], and they claim to try to do some similar things, not with Fusion IO's difficulties and not with flash and not seeing a huge speedup by doing this RDMA trick. But it seems like a natural thing to do. So at least from a research perspective I think it's worth exploring some more. Let me step a slide or two, and then we'll have time for more questions. Okay. So the preliminary micro -- the preliminary performance evaluation when I talk a -- use the I/O zone which is a fairly well-known UNIX tool that just does a whole lot of I/Os with a certain distribution and we'll take a look at that. One of the things as I said you have this latency versus throughput issue with the device, so we'll take a look at I/Os per second is a function of a number of threads. Also compare with a couple existing commonly used file systems on Linux, just for those of you who aren't familiar with it. Ext2 is pretty much like the old UNIX file system and if you crash you can be screwed and you probably will be. With Ext3 there's logging and you should get, Ext3 is more or less the semantics that we're trying to provide. As I said, directories aren't fully implemented so we don't quite meet that, but I'm also not going to present numbers on the directory performance, so I think it is a fair comparison. And the other issue going -- I can't remember who asked this, but you know, it was the question of what's the CPU overhead. And I think there are two issues to remember here. One is that there's a fair amount going on in that device driver, and so the question -- one reason -- one reason for being interested in pushing it down is to get that --off load that from the host CPU. And then there's also the question of toss the file -- what the impact of the file -- just the file system code let alone the device driver. Also talk a little bit about memmap. And in that con the text, I'll also talk about how many context [inaudible] seeing a difference in the number of context which is between these different file system implementations. And then finally I'll talk very briefly about a -- building a 64 gigabyte hash table on flash and what kind of performance we're seeing with that. So it's kind of a neat little thing there. >>: Can you explain what recursion I/O device what is in the interface that the computer [inaudible] talking to that device? [inaudible] hash table or something? >> William Josephson: Let me see. I have to think about this a moment, because a lot of it's under NDA unfortunately. I think the simple answer is that there's a proprietary interface between the device driver and the hardware device. A block device interface exported from the device driver to the kernel and I'm talking with the Fusion IO folks about trying to figure out if we were to export some of what is currently that proprietary interface what should it look like as opposed to what it currently looks like. I think the answer is that like many startups it's maybe not the most polished interface, it's the interface that works. I'm not sure that is a satisfying answer, but ->>: I think the content of the [inaudible] you can I mean just I mean [inaudible] hash table or something ->> William Josephson: I think the tree, the model that I gave you of it being a tree is a much better model to think about, because some of the locality you tend to get with the tree is actually an important part of how they get good rate performance. Because what they're -- what -- as you observed when we were talking earlier, flash has relatively poor write performance, but if you get sequential updates, you get much better performance, even with flash, and so there's a lot of work going on using a B tree as the index to try to get sequential updates. So it's log structured and that's -- does that answer your question maybe? >>: I think I mean still I may be a bit confused about [inaudible] your contribution or the file system and what's [inaudible]. That's basically. >> William Josephson: Fair enough. So I think ->>: [inaudible]. >> William Josephson: So far the ultimate goal is to figure out how to use this device for things like databases and running Oracle and that's the next step actually have some [inaudible] we have a simulator, we have some ideas about what that should look like and I need to finish the implementation before I can tell you what the answer is. Right now the observation is or -- as I'll show you that given the work that has to be done to make a flash difficulties go and perform that it's probably best to get out of its way. And so that a very simple file system -- I mean, this file system is less than 3,000 lines of code, compared to Ext2 which is 18,000 lines of code. And so the -- with, you know, one person working -who country really know the Linux kernel that well as opposed to other kernel can in a month write a file system that actually will perform better than existing ones. >>: [inaudible] Fusion IO [inaudible]. >> William Josephson: A lot. So I mean and that's a good point. But my -what I'm saying is I think that it makes sense to separate the block allocation component and that -- they are -- they have -- they would have to provide a high performance device, they are trying to perform a high performance device, and they have particular layout on the device -- you know, they have a particular hardware architecture on their device, and so they can tune their device driver specifically to that piece of hardware whereas if I'm writing a file system for a commodity operating system, I want to work with a lot of different devices and so that you can innovate in the device driver and in the hardware in the firmware separately from the file system and that that's the right place -- my argument is that that's the right place to do it, not in the file system for this type of device. >>: [inaudible] dealing with this, throwing away all this [inaudible] in the operating system that you have. [inaudible] there's two allocators and you're saying toss one away, and it seems like two options are you toss away the flash allocator and you say let the operating system try to make good decisions about allocation because it knows about the [inaudible] structures or you toss away ->> William Josephson: Well, I ->>: [inaudible]. And you ->> William Josephson: I agree with you in general. >>: I was just asking. >> William Josephson: I argue that the file -- that the operating system -so for one thing, it is possible to still ask the device driver a lot of the same questions. You can say -- I mean, you actually have access to the -- if you wanted to, you could export the access to the -- to its internal data structure and look into this tree that's representing a range of blocks. One. Two, how valuable is the block allocator data structures for introspection in general, in a file system? I mean, the one case -- one of the few cases where I could see where you really do want to look in is defined holes in a file when you do backup. And it is possible to do that with the device driver you can say enumerate all the actual existing pieces. Here's a lock range, enumerate all the parts that are actually populated. So you still have that -- you can still peer into it in that sense. And, yes, the operating system could make good decisions but the operating system in general doesn't know and the company building the flash device isn't going to tell them how to figure out how to do allocations in a way that performs well for their particular device. That's part of what they're selling. In fact, if you look at Fusion IO, they're going to say that their intellectual property that's valuable is figuring out how to build the hardware parallelism, one, but perhaps more important how to develop the firmware and device driver that gets you good performance on that system. >>: [inaudible] and then you have no reason to differentiate [inaudible]. >> William Josephson: Right. >>: So [inaudible]. >> William Josephson: And so this allows the user to innovate separately. May not buy it, but that's my argument. >>: One of the other conundrums with pushing functionality in the device, classical conundrums is as you, you know, push more and more potentially coding the device, right, by the time they get that done [inaudible] gets to market maybe a couple generations behind. >> William Josephson: And that's the ->>: The CPU that you hang it on. >> William Josephson: That's a very good point. And so the question is -and right now actually there isn't as much pushdown into the device driver as you might think. I mean down into the device. A lot of it's actually running in the device driver on the commodity CPU. And so -- and I said, you know, maybe argue, may be a useful thing to do. I think that the counterargument is precisely the one you made is that, you know, pushing in the whole file system on to this device that probably is running an embedded processor several generations behind may not make sense. Okay. So this is the pretty picture maybe, but really what I want to you get out of it is we have in each group we have 1 through 64 threads in powers of 2. The first three represent -- the blue ones represent write performance for different kinds of writes and the red represent read performance for different kinds of reads. And what you see is that write performance peaks around 16 threads and read performance flat lines between 32 and 64 threads. And part of the reason for that is that you eventually run the garbage collector hits a wall when you get enough throughput, and that's around 16 threads. But read performance you basically run into limitations of the latency per operation and the number of the depth of the pipeline. So the -there's not a lot deep going on here. I just wanted to show you that you -the sweet spot for writes is not the same as sweet spot for reads. >>: So [inaudible] sequential ->> William Josephson: So the purple line, the farthest to left in each group is the first write to a file. The file hasn't been populated yet. The next one is rewriting sequentially. The next one is writing a random -- doing random writes. And then you have basically the same things for reads in the next three. >>: [inaudible]. >> William Josephson: These are all with 4K and you're bypassing the buffer cache in each case. >>: [inaudible] is ->> William Josephson: I'm sorry? >>: The OS here is relevant. >> William Josephson: As I said, when -- all this evaluation is Linux 2.6, recent Linux 2.6. That's what I have a device driver for. And they actually do have a Windows device driver now, I think, but I haven't used it. Okay. So again since the -- as I alluded to earlier, a lot of our interest is in things like database performance where direct I/O dominates, that is I/O that bypasses the buffer cache dominates. So most of what I'm going to be talking about will be bypassing the buffer cache. And an important thing that I want to emphasize here. The way I/O zone works is that it -- when we say we have one thread -- well, let's say we have four threads, that means there are four processes and there are four files. Each process gets its own file to do I/O to. And then we'll see an artifact due to that in a little bit. And again, these are all average numbers with -- I have not -- I haven't reported the standard deviations here, but they're fairly small. >>: [inaudible] although a [inaudible] cache it might still advertise large writes? >> William Josephson: That's right. [brief talking over]. >> William Josephson: Well, so there are two issues here. One, if you look at something like Oracle, I can't speak to other databases, but for Oracle, actually unless it's a blob or actually typically doing the -- they're going to do to the cache 4K or 8K writes, for blobs that's right, it's a different issue. The other thing is a number of parameters here, and I decided on a slice. I've tried to make that as fair as possible but at some point you're just going to have to believe me that I'm not hiding something. >>: [inaudible]. >> William Josephson: I have not. I've just -- just by its default. And in fact one reason I'm not reporting Ext4 numbers is that I've seen really bad Ext4 numbers and I don't understand why yet. And I didn't feel comfortable making fun of them until I understood whether it was my fault or theirs. Okay. So this is with one thread, and you'll see that there's small improvements for write and also small improvements for read. With four threads, again, small or no performance improvement actually on the write side this is a little bit of an anomaly and the reason I believe is you have four threads on four processes plus there's also a garbage collection thread that's running in the background. And as far as I can tell the scheduler is thrashing a bit. It kind of goes away as you get oversubscribed with the number of processes versus physical processors. Okay. And here's with 16 threads. So on the write side you see a modest performance improvement. These are, remember in represented in thousands of I/Os per second, and the new file system is the red bar. Ext2 which doesn't provide crash recovery guarantees is blue and then the green is eTX 3 which is not set up for journal entry, it's just set up by the default configuration under red hats setup. And you'll notice with read there really isn't -- you basically aren't getting any performance improvement -- any performance improvement to speak of. >>: [inaudible]. >> William Josephson: How far? More? >>: More, yes. >> William Josephson: Yes. >>: [inaudible] I know this, I mean except in large basic performance, this EXFS have the slight [inaudible] so there's two parts, right, I mean the basically the next three is read, read, read, and [inaudible] first read is write. Now, in the beginning you talk about the basically [inaudible] essentially implementing this as an [inaudible]. From my point it will be a large [inaudible] basically [inaudible]. >> William Josephson: That's right. >>: And with some interface separating [inaudible]. Right? And I mean the read, the comparison is basically Ext2 or Ext3 implement on the flash. For the read performance I assume I shouldn't see too much performance difference because all this system in a sense is you find the flash block to read and then basically to just [inaudible] from the flash drive. So I expect the performance more like the third basic [inaudible]. >> William Josephson: That's right. >>: And read is basic flat. Could you explain why -- I mean, you were even against something during [inaudible]. >> William Josephson: It's a good question. I'm not sure I have all the answers that I'd like on that. Part of it is that I don't have -- with Ext3, well with the new file system I can compute given a block -- if I do a block I can compute, if I get a logical read I can compute exactly what flash block to look for and with Ext2 and Ext3 I can't do that in general because I have to look -- go look through a bunch of indirect blocks. >>: Okay. >> William Josephson: And look that up. >>: So maybe [inaudible] T3s case, I read a first read I know and then I ->> William Josephson: That's right. >>: I could basically ->> William Josephson: And it may be that even if -- even if those -- if you think of it as a tree, those internal nodes may actually be [inaudible] I'll find them and do some locking to access them. So I think that also the -- I'm -because this new file is so simple I can actually get away with some is simpler locking in the DFS interface in 1X, and I don't know is it an inherent think or not? I really couldn't tell you unfortunately. It's something I need -- I need some more introspection into what's going on in the kernel. But my suspicion is that it's a combination of locking and not needing to look at indirect blocks, I can compute exactly what block to request from flash. The mapping is just, you know, a multiplication and an addition rather than looking at -- looking through some block. >>: I think it's the right [inaudible] file system is the [inaudible] versus Ext [inaudible] with beta structure. >> William Josephson: I think that the write side is generally more interesting for two reasons. One, write performance is typically lower on flash anyway, and that's really what most people are worried about when they look at flash is they're worried -- more worried about write performance than they are read performance, particularly random write performance. >>: Is there some way that you could -- >> William Josephson: That's something I've been talking to the developers at Fusion IO about is there a question of what should the priority of that thread be and also whether or not it would make sense to start pinning these things so they don't bounce from one processor to the other. And they report to me that some this of their tests by pinning it or manipulating the priority that they do get somewhat better performance, but I don't have that version of the driver at the moment. Okay. So this is -- I can't remember who asked this question, but it was some question about CPU overhead, and I have a -- give you a -- what I'm looking at is so remember there are four processes, so under Linux it means you can have 400 CPU usage. So 100 percent just so we're kind of make sure we have our units right. But for across the operations we've been looking at typically the CPU utilization is somewhere between one and a half and three and a half percent of CPU for every thousand I/Os per second delivered. And so what I'm looking at when I say that is I take -- if you -- in UNIX get our usage and you can ask for wall time, elapse, user time elapsed an system time where system time is time spent in the kernel in behalf of your request. And so what I'm looking at here is user plus -user plus system time normalized by wall time. For a thousand I/Os per second. And so what we see is that particularly pronounced for low numbers of -- for lower concurrency, but again looking at 4K direct I/Os and I'm not -- and this particular table we're not looking at any change in number of I/Os per second delivered, we're just looking at percent CPU, the change in the CPU utilization when moving from Ext3 to the new file system, DFSS. So that means that in addition to reducing the CPU, you're also in general getting better performance. But this doesn't reflect the fact that you're getting better performance. You're getting that better performance at a lower cost as well. This is just the change in how much CPU is used for one to 16 threads for random -- for reads and for writes. So again there's a little bit of something funny going on with four threads, and I think that, you know, that's something that perhaps by looking at priority of the driver thread or pinning threads we might be able to address that further. But aside from that, there's a fairly clear trend. So it's cheaper as well. >>: [inaudible]. >> William Josephson: I this think that we'll get some insight into that when I look at this hash table, at least for the read side. And it looks like a lot of what's happening is that you got to ask for a lock and you can't get it, and you get rescheduled. >>: [inaudible]. >> William Josephson: This is actually a very, very simple approach. Literally each I/O request goes all the way down to the device and all the way back up, and however many threads is shown in the left column doing this. And there's nothing fancy going on at all. >>: I don't know [inaudible]. >> William Josephson: So one problem that I have seen -- so there are two approaches that you could do, you could imagine using in Linux. One is just having individual threads issuing a read or write system call, and that's what I'm showing here. The other option is this POSIX defines this as asynchronous I/O interface. Asynchronous I/O interface actually comes from Oracle. They are the ones that really pushed for that, especially in Linux. That's what they often use. And it turns out that there's some problems with the AEIO implementation, and I don't know whether they're in the kernel or in the device driver yet. This is a common complaint, particularly with Fusion IO, and there's some work to be done there from an implementation standpoint. Because AIO actually delivers less performance with Fusion IO's devices across all file systems than a multithreaded approach to us at the moment. And that does need to be fixed. >>: [inaudible]. [laughter]. >> William Josephson: I don't -- I mean, it's one of those things where I don't know the answer. So ->>: [inaudible] context switching overhead and ->> William Josephson: I think that ->>: You use [inaudible]. >> William Josephson: No. This is all because the device from -- Fusion IO does have some prototypes using MLC NAND, but for one thing a lot of their customers are enterprise customers, and they don't want to see MLC. And so because it's cheaper, a higher density, but it has higher failure rates an lower number of write cycles. >>: Just [inaudible] one more second. >> William Josephson: Sure. >>: So that 1.5 to 3.5 percent CPU is per CPU. >> William Josephson: No, it's -- so, as I said, this is a little strange. The way Linux works when you ask for CP utilization, you get -- if you have four processes, you can have 400 percent CPU utilization. So you really -another way of thinking, this is one and a half to three and a half out of 400. It's a little bit -- a little bit odd. Perhaps I should have renormalized them but I haven't, because that's the way that the operating system reports it. >>: [inaudible] 400 percent. >> William Josephson: Well, the reason I wasn't entirely -- I didn't necessarily want to normalize this, also will, of course it's not like you suddenly have 400 percent -- the 400 percent is a little misleading because to get 400 percent you've got to figure out how to divide your job into four pieces that are parallelisable. Okay. So this is from memmap and actually you see a little bit different story with memmap. So this graph, what I'm showing is we have I/Os per second delivered, CPU per I/O and wall time. So for I/Os per second we're seeing an increase. That's how many -- when you move from Ext3 to VSFS, you'll see an increase in the first place rewrite 31 percent more I/Os per second delivered. You'll see a 38 percent reduction, so the sign is different. 38 percent reduction in CPU per I/O. And a 24 percent reduction in the wall time. And again we're looking at rewrite, random write, reread, a random read for one and two threads. I have similar numbers for one, two, three, and four threads. Beyond four threads, Linux fell over. So -- and I haven't -- I haven't tracked that one down. It just seems to be a bit more than it can take. Some bugs in the kernel. So memmap actually the performance difference is more significant. I did exclude a first write with memmap because usually it's at least with UNIX [inaudible] system it's a really bad idea to do a first write with memmap. It tends to really scramble the file system. You really want to fill the file with zeros and then do the memmap. So in this case, we're looking at -- that's a great question. 32 gigabyte file on a machine with four gigabytes of RAM. So that's a very good question. Something I should have had there. So you know, the DRAM side is large -is fairly large compared to the total size of the file. Okay. So the last thing I want to talk a little bit about, microbenchmarks at least my way of thinking is mostly to convince yourself that there might be something there and not to prove that there's something there. The first -- I have a few more realistic microbenchmarks from Sandea [phonetic] that I haven't had a chance to run yet that are more of an HPC nature. But one of the things that we've been doing in our work is looking at -- we're interested in very large data sets in general. Some of those are text data sets and many of them are not. But a common problem in both cases is how to build an index for that data set. Particularly when the data set may be so large that the index doesn't fit comfortably in DRAM. So we took a look at just one particular one. In this case, we used the Google n-gram corpus. These are n-grams, words found on the web and for easement n-gram, they actually have one, two, three, four, and five grams and they have the n-gram and its number of occurrences on the web. And this is pretty -- this is used fair amount in variety of machine learning or computational linguistics problems. And a common problem is that it's just too big to fit in D ram on a work station. There are 13 and a half million one grams and 1.1 billion 5 grams. And so what's fairly common to do is to take the actual n-gram and map it to an identifier and the identifier can fit in 24 bits because they're 13 and a half million for one grams and then that gives you a 15 or a 16 byte identifier for the five gram. The memory footprint of the result is pretty close to 26 gigabytes of data. By being clever with your encoding, you might be able to reduce it further, but it's large. And it's too large for most work stations. So one approach that a lot of people have taken is to use an -- some approximation method and have approximate queries. So that I will take a five gram and they'll submit it to their index, and they'll get an answer within -- with some probability of error. Another approach that a lot of people have taken is just get a bunch of machines with enough DRAM in aggregate and then broadcast the query or figure out an assignment of n-grams to machines and send the query to the right machine. Another approach of course is it's still small enough that you can easily fit it in memory if you buy a large enough machine. There are after all SGI machines with a terabyte of DRAM that you can buy. But our observation is that in general memory subsystems are expensive. It's not just the cost of DRAM. When you look at this, you have to consider not only the cost of DRAM but also the cost of the memory subsystem that goes with that. And then another approach, and the one I'm going to talk about, of course, is that you can of a work station with a moderate amount of DRAM, two, four, four, six gigabytes of DRAM and a flash disk and put the index on a flash disk. So that's what we did. So the design for this is really fairly straightforward. I'm sure that -- well, I know that one could do more optimized design. You just divide a large hash table into fix size buckets of four kilobytes and sort all the keys in each bucket. I can precompute the occupancy histogram by using a single hash function so I know how big it has to be so I can guarantee that there are no overflows, because it's not a case where the dynamic updates I have the full set of keys in advance. I can keep a small cache of these blocks and pin them in memory to avoid talking into the client, so I can just lock -- lock a cache block into memory and hand a pointer to the client. And I can -- the I use either a clock or in this case either a random replacement to avoid a single lock on an LRU chain. Obviously there is a well studied problem. We know how to paralyze it better than this, but depending on your query distribution, it may be that random is just fine. If you have a very low hit ratio anyway, why go to the trouble of doing something sophisticated, one, and, two, a lot of the time these sorts of applications are not written by systems folks. You have to remember that a lot of these are done by people who want to do a solve a machine learning problem, they want to solve HPC problem and are not, you know, the scientists or machine learning people, they're not systems people. So I think in the long term the question is are there other primitives that we can provide to them to get -- so that they can get better performance having to implement something that's sophisticated. But given all these simplifying assumptions that I've made, the initial hash table construction is problematic. You have 1.1 billion inserts to do and you're getting let's say a hundred thousand write I/Os per second, so the obvious thing to do is just generate -- I can generate the file of key value pairs and their hashes very easily and I can just sort it, and I can just insert and sort an order and then I get great hit rate even in a tiny cache and I can actually build the hash table in a time comparable to just copying it, which is on the order of a half hour. Certainly with optimization you could do the better, but one thing you can do. And then, you know, I -- one of the steps here is you do have to do that sort. And of course there are a lot of external sort programs. But why not just see what happens if we memmap the file and call a qsort. And makes for an entertaining -- it's sheer laziness. It works. You have to do it once. And it actually presents kind of an entertaining little pathological test case. And here's the results. In this case, so I could run it a bunch of times and not have to wait a day, we just did the first 65 percent of the data. So the first -- these are times for the first 715 million keys using a -- it's an optimized but single-threaded qsort. The difference is that unlike [inaudible] qsort there's not an indirect function call for the comparison, that's actually in mind, but that's the only change. And so what you have here is the wall time is blue is the new file system, VsFS, green is Ext2 and red is Ext3. And remember that VsFS is providing crash recovery guarantees closer to Ext3 than Ext2 so it's not surprising that it's not necessarily quite as fast as Ext2. The other thing that this graph doesn't show -- I mean, you see a big difference in wall time and a smaller difference in system time and virtually unchanged in user time. The other thing that this graph doesn't show is that Ext -- when running on Ext3, there are about 25 percent more voluntary context, which is that's not too surprising who would guess that from the difference in wall time? That just means that when you're running on Ext3, you spend more of your time waiting for the operating system or the device to do something. So then the question is well what happens if I actually try to do -- to run some queries and we chose -- I chose two differently query distributions because obviously the query distribution makes a big difference. The first one is uniformly distributed, which is probably not that realistic for the machine learning context. The next one I'll show is for Zipf distributed with one parameter for the Zipf distribution chosen. But what we did is run 200,000 queries, in this case choosing the queries uniformly. The block cache that I talked about is very small. It's only 1,024 blocks. And again is with direct I/O. And so what we have here is wall time user plus system time and then the number of voluntary context. And again, this is a percentage reduction so you're going when you move from Ext3 to VsFS, you see with one thread you see the wall time drop by five percent, user plus SYS time drop by 20 percent and basically the number of voluntary context which is unchanged. It looks a little different with 16 concurrent threads. And again, I think that to try to understand what's going on, we talked about this earlier, the -- is there additional locking going on in the Ext3 clearly and there is also if they root around through the indirect blocks. So that's one issue. Now, for Zipf distributor query I don't really know what a good choice for the Zipf parameter should be. I did look at a number of values. The problem is that if it's -- if the distribution is skewed too much, you're actually looking not so much at performance of the file system but at performance of a rather simplistic and small cache and so for our purposes I think that's less interesting. So in this case, using the same hash, hash table, the same cache implementation, same parameters for the cache, the only thing that's changed is that the queries are Zipf distributed and the particular choice for alpha we made is this 1.0001. But to remind you what that means, that means that there's -- that instead of being uniform, it's a very skewed distribution. There are something to be some keys are going to appear much more often in the query trace than others. And the larger alpha, the more skewed it is. And the qualitative improvement is similar. One of the things I've -- or that probably is worth looking at is to see if there's any -- because of the way that the query distribution is constructed, it's quite possible there is some false sharing. So in a real scenario it wouldn't make sense that if two -- if you have two queries that are both popular that they're very likely to end up in the same hash bucket whereas in this particular run that is likely to happen, but if we just randomly permute the query and keep the distributions the same, then we could account for that. This doesn't account for the that. Okay. Let's see here. I'm just checking on the time. Okay. So here's the part -I'm almost done. This is next to last slide. Really the last slide. And I just want to -- some musings on what's the next step. Clearly the CPU overhead of the device driver is a significant problem, especially for some workloads and particularly the write side suffers from that. As we talked about earlier, there's some question does it make sense to -- where is the write line to draw for storage management on the flash device? Or does it even make sense to push it into the network on the RDMA like interface? And I ran a few microbenchmarks to see what happens. If you talk to the file system directly from the kernel and eliminate the context switch you -big surprise. You get a performance improvement. But does that mean RDMA makes sense? I don't know. One of the fellows who just left here said, it's some question about is the device going to be able to keep up in terms of hardware with commodity hardware and so you may get bitten by the fact that the commodity hardware just is much -- gets faster faster. And I think that the more important issue is that it's not really any compelling reason to interact with flashes and order mass storage device. Does it make sense to -- the exported interface is lose this key value pair or some kind of hash like index, is that the right way you should optimize that, provide that as a library or even push it into the device driver? And then instead of this partitioning of sparse block address space maybe it makes sense to actually have a first class object store because then you can attach some additional metadata to each object so that for instant if you're using it as a cache for a database or for a web proxy or whatever it happens to be you can associate some additional metadata directly with the object. You could also do that through the file system but maybe it makes sense to make that through a library or through the -- through the system software stack, a first class abstraction. Okay. So I think we've covered all these points already. But with a little secret sauce NAND flash is interesting. >>: [inaudible]. >> William Josephson: Sure. >>: Would you put that over into what is the compelling ways interact with [inaudible]. >> William Josephson: Well, that's a good question. I guess to my mind the thing that's different is that between the -- if I am using flash to build a cache for a database, for instance, the number of I/Os per second makes it -- I think you're going to interact with it differently. Now, it may be if you had a difference, this new storage interface and was backed instead of by a small number of disks by a large fan with a large cache and a lot of disk parallelism then the answer may be that you want this different interface for that as well. I don't think that it makes sense -- but I can imagine pushing this kind of flash into a laptop and using this new interface on a laptop. Laptop with a single disk I don't think this -- these interfaces are as interesting. >>: I [inaudible] very sure about this when Amazon direct service [inaudible] very similar API as [inaudible] so I guess the -- do you think this interface is good and is flash specific ->> William Josephson: I don't think it's flash specific. If that was your -- I don't think it's flash specific, but I do think that it -- I don't think that it necessarily makes sense for individual -- for individual consume are disks. >>: It does a very nice [inaudible] interface for storage but [inaudible]. >>: This goes back to the whole [inaudible] question. This is [inaudible] semantic functionality from storage functionality and direct access to ->> William Josephson: Still I'm not claiming this part of -- this idea is new [inaudible]. >>: But I think it makes sense. >> William Josephson: Yeah, I think it does make a lot of sense any guess what I'm advocating is that -- is flash -- you know, it's unfortunate to have flash -- you know, there's no reason to have flash continue to have this block based interface. >>: [inaudible] you're hiding multimedia's running behind it. >> William Josephson: And moreover in the case of flash as we saw, for things like garbage collection and right wear-leveling and all these things is a fair amount of mechanism there, and I think that the one thing that may be different is that it makes sense that somebody already has to go to the trouble of engineering all of that to get good write performance that it makes sense to have these -- this interface in that -- in that layer as opposed to -- as an additional the layer on top of it, because the extra -- the stacking the abstractions has a cost. And the -- you know, they're not actually adding -- adding significantly to the burden of the person who is already implementing that more primitive abstraction. Okay. Well, I think that -- I think that pretty much does it. Further questions? >>: Why does flash need to have a number of pages in one [inaudible]? >> William Josephson: So the big difference between NOR flash, which is random access and NAND flash is to reduce the number of wire or wires on the die, and so there actually a lot of the -- and I think that the primary reason is to make it denser when they fabricate it. It's a fabrication, a fabrication density issue as opposed to a fundamental issue. It actually readout from these things is typically serial at the chip level. >>: So in the first half of the talk you've mentioned that the [inaudible]. >> William Josephson: Yeah. Now, so kind of the conventional wisdom that I often hear about flash is that when it fails it's a failure to write or it's fair to erase and so your data is still there and it's not a big deal. Talking with the Fusion IO guys is it's simply not the case but the actual failure modes is something that seems to be fairly, as far as I know a fairly tightly held secret in the industry particularly when the people like Samsung who are the ones who fabricate the chips. And I haven't been able to find a lot of good really world information -- information about failure mode. There's some, some device work that's been done in places like IEEE, but to actually see what types of failures are happening in the real world in the enterprise, I don't know of a good study of that, and I think it would actually be very interesting to know. Because they have found they really do need to have additional ECC above what's provided by the chips in order to give reliability that you would expect. And they've tried to describe some of these failure modes, but they're -- I don't have a paper to refer anyone to on them. But that's a significant practical challenge apparently. Not something I know much about. If there any other questions, I think that that should do it. >>: I have one additional question. When you implement flash [inaudible] right, I mean [inaudible] and file system [inaudible]. >> William Josephson: Meaning ->>: Memmapping basically between [inaudible]. >> William Josephson: The virtual address and the actual physical [inaudible]. >>: [inaudible] if that's also implemented in a [inaudible] in the flash drive? >> William Josephson: I'm sorry. Say that again. >>: So I mean basically. >> William Josephson: The index has to be locked. Is that ->>: No. I mean what I mean is it's -- I know basically for each flash block our data is for -- they usually have additional metadata in the order of basically [inaudible] which basically [inaudible] some problems to that [inaudible]. >> William Josephson: And in fact, what ->>: But if you do that, I mean, basically every time you boot you have to basically take a long time to read this. >> William Josephson: That's -- and then that's a real problem. What I -what I -- what most people seem to do in practice if they want performance is they keep it there and they have an additional chunk of flash. I said it was 160 gigabyte formatted capacity. It's actually much larger -- not much larger, but larger than that. And part of that stores a log. And it is literally a write ahead log of sorts. That's stored, the non volatile portion of which is stored in additional flash that's not addressable by the user, only by the device driver. >>: Now, the question is -- is that additional storage from the flash memory is still useful? I mean basically apparently you want to store the index in the certain basic hidden space so that when this pulled out [inaudible]. >> William Josephson: Right. >>: From the basic flash -- we view basically the index in the [inaudible]. But if you don't actually work through the [inaudible] basically flash drive and leave basic information attached to the each of the block doesn't seem to be necessary. >> William Josephson: The fact of the matter is I'm not intimately familiar with that portion of the software stack in this device. I think there's also that's where some additional error correcting information is kept. >>: Yes. That's fair. Because I know that that space actually [inaudible] information is together with [inaudible]. >> William Josephson: I think you had another question or ->>: So will we have any [inaudible] RAM on the flash? >> William Josephson: They do. >>: So where the [inaudible]. >> William Josephson: They're actually currently held in the device driver on the host CPU. And that's why there's a -- you know, the garbage collector is actually running in this device driver. That's why this device driver is actually an expensive thing is the ->>: [inaudible] so if you ever ->> William Josephson: Well, there's also a processor running on the device, and when you -- it keeps enough information to do the logging as the requests come down from the device. And so when you crash and come back up, there's enough in this separate area where there's a write ahead log to reconstruct ->>: Okay. >> William Josephson: A consistent snapshot. May not be the most recent snapshot of what was held on the host CPU in the driver. So there are a -particularly some of the high performance computing folks find it very frustrating that a fair amount of RAM and CPU horsepower is used by the device driver because -- but, you know, I think it's a -- it's a very good point that in general you're not going to keep up with improvements on the host CPU in the hardware device itself. Company that size isn't going to be able to do enough iterations or get enough volume. And the DRAM on the device itself, although there is some, it's not a lot. >>: So that's the [inaudible]. >> William Josephson: It is. >>: So if you have a [inaudible]. >> William Josephson: That's right. So there's a chance to make sure that everything is written appropriately to flash. >>: [inaudible] actually [inaudible] you can move all the RAM into the system and probably get same amount of [inaudible] performance. >> William Josephson: There are a bunch of design alternatives, and I -- I just -- since I wasn't involved in the design decision, that's fundamentally what I'm doing. I don't have a good sense of where the best place to draw the line is. And of course with a startup, they also have a lot of economic forces as opposed to just research force in deciding where that line is drawn. >> Jin Li: Any additional questions? Let's thank you William for an excellent talk. >> William Josephson: Thank you. [applause]