>>Galen Hunt: All right. It is my pleasure to introduce Mike Swift. Mike and I have known each other since Mike was a graduate student at UW and I used to go admire his nooks papers and then go up to him afterward and go, wait a minute, this has got to be a lot harder than this. What are you going to do about [inaudible] and things like that? At which point, Mike would have the discussion of the difference between product code and research code, things like that. And, oh, Edsy [phonetic], you’re in good shape! Edsy showing up. And let’s see, I guess the other thing. So we actually did try to hire Mike, but he had a life-long goal of being a professor. >>Mike Swift: That’s true. My father is a professor and I had to live up to his aspirations. >>Galen Hunt: At the point in the selling conversation where Mike said, look, I saw my dad teach when I was twelve and I’ve had a dream to be a professor ever since then. So he had to go try. >>Mike Swift: You know, at this point I can’t remember if that’s really true or not. [inaudible] since I told you it was true. Whether it was in a dream or just an assumption ->>Galen Hunt: It made me feel better in letting him go to Wiscons. So anyway, we’re glad to have him back to tell us about the work he’s been doing. >>Mike Swift: Okay, thanks. Anyway, thanks y’all for coming. I’m going to be talking about the work that I’ve been doing on looking at the interface of solid-state storage devices like Flash or Phase change memory. So to put this work in context, I think I’m probably most known for work on device drivers and I’ve sort of expanded since then. And my real interest is kind of the bottom half of the operating system where you’re really focusing on interactions with the hardware and what that looks like. So with device drivers I first look at liability and then I’ve been doing work on sort of different models on how you program drivers, or where the code runs and running it in languages other than C. In terms of memory I spend a lot of time looking at transactional memory and how that relates to the operating system. How do you virtualize it, handle page fault and context switching? Recently I have some work on dynamic processor. The processor can reconfigure and again, what does this mean to the operating system? Most recently I’ve been looking at solid-state storage. And my interest here really is saying if we have these new devices, what does it mean to the operating system? Are they really just faster, smaller, lower power disks? Or are they something different that we should treat differently and think about new ways of interacting with them? So we look at storage trends over the last ten years, I think as all of you know, there’s been this huge trend from sort of spinning [inaudible] disks to solidstate disks, and then hopefully we’ll get to this. This is the Onyx prototype from UCSD, which is built on phase change memory, which is sort of again an order or magnitude or two faster than flash storage. And so what’s interesting about this is as you look at these devices, these newer devices are not just smaller faster disks, but they have dramatically different properties and they behave very differently. And I think that’s what makes it interesting to look at how we interact with them. So first looking at flash drives, the key property of flash drives, as I look at it, is that [inaudible] the storage medium is something you can’t directly overwrite. You have to erase large blocks of flash with is pretty slow in the order of milliseconds before you can write to it. And what this means is that inside most flash drives there’s a piece of software called the flash translation layer that will take incoming addresses and then route it through some kind of mapping table to find where the data actually lives within the device. In the device, data is written usually in some kind of log structured format where whenever you write new data it writes it to a new location and then you can asynchronously garbage collect the old data. So this is very different than traditional disks where you just overwrite data directly. In addition, a key feature to flash is this limited endurance where you can only overwrite flash a limited number of times, which means that you have to spend a lot of effort in terms of where you do these things to sort of level your writes out across the whole device. And the net result is you’ve got this fairly sophisticated piece of software on a processor inside your flash device doing all this interesting work. So the second new piece of storage technology is storage class memory, which I know there’s been a lot of work at Microsoft. And the key features that I’m interested in are first that you can sort of attach it to your processor in a way that makes it act a lot like DRAM. And this means that instead of using a device driver with a block interface, you can potentially just use normal load and store instructions within the processor to access memory. Furthermore, it’s very fast, so it’s faster than flash, and this means that software overhead matters a lot because all that software overhead really sort of hides the short access time. So the things that this potentially enable are things like having very fine-grain data structures that we make persistent because we’re not limited to just writing out data 4k at a time, we could write out four bytes at a time if we wanted to. And furthermore, it allows perhaps direct access form user-mode programs because if we can just make it look like memory, programs can do load-and-store instructions and they can potentially just do load-and-store instructions against this storage class memory to access persistent data. So, again, this is very different than disks where it is very slow and you need a device driver. So if we look within the operating system, though, and we see how does Linux, for example, abstract these kinds of devices? What we see is that really not a lot has changed. That if you use flash as a file system, which is what most people do, you still have your standard read and write for reading and writing data and then sync for forcing data out and making it persistent. You can do m-map and m-sync if you want, but it still is more or less just like a file in a file system. Lower level within the operating system there still is standard block interface, which has been around for a while, which is basically reading a block with b-read. And then you can submit a synchronous IO request using submit BIO to read or write a block. But it’s still just transferring a block of data in or out at a given address. No acknowledgement whatsoever that these things are more sophisticated. So the reason this is important is that these devices really are fundamentally different and they have really interesting features that using the old interface conceals. So if we look at flash SSDs, they have address translation built in. And I think every operating system person knows that address translation is a great feature that we’ve been using for all kinds of nifty things like copy and write, and shared memory, and protection and things like that. And this great feature is built into devices you can buy. But we can’t get at it to do anything interesting with it. [inaudible] garbage collection, which again is a very powerful general-purpose technique for managing data structures that is being done inside the device, but it is not accessible outside the device to do anything interesting. And it would be nice if there were a way to make these devices available to applications. So one thing you can look at is anybody who is building high performance applications that runs on flash typically tries to do their own garbage collection on top of it where they’ll sort of write things out sequentially and then they’ll, because sequential performance is so much better, and then they’ll do their own user level garbage collection where they compact and copy data. And that’s the same thing the device is doing internally and there’s sort of a duplication of effort here. [inaudible] for storage class memory we’ve got this really low latency access, we’ve got byte address ability, but if you go through a file system you lose all that. So applications don’t get sort of the raw latency that this can offer you. So to that end, what I’ve been trying to figure out is how can we expose these features to the operating system and applications better by looking at the interface that they present. Instead of saying everything must be a file, is there something else that we should be doing here because files may not be the end-all be-all of data access. So specifically for any kind of device that has internal management I’d like to expose those internal management features to applications or to system software so they can take advantage of them, at least coordinate their activity. And for anything that has really sort of a low latency, I would like as much as possible to sort of bypass all the software layers on top of that and really get that low latency right in front of applications so they can use it. So with this in mind, I’d like to talk today about two projects that we’re doing. One is a paper that we presented at [inaudible] called Flashed Here, and it’s one approach at sort of rethinking the interface to solid-state disks by looking at what would we do if we were building a solid-state cache instead? If we knew we were building something just for caching. And then second I want to talk about a paper we had at Asplus [phonetic] last year on a system called Mnemosyne [phonetic], which is how do you abstract storage class memory and make it available to user mode applications? So the reason I’m interested in Flash as a cache is it’s a really natural use for flash. So first of all, flash is a lot more expensive than disk. So if you’ve got a lot of data it could be too expensive to move everything into flash. It’s maybe 10 times more expensive. But at the same time it’s so much faster, maybe 100 times faster or more that you really want to get the advantage of that performance. And so a natural thing to do is to stage all your hot data into a flash device, and then hopefully most of what you need to access -- I’m jumping ahead of myself -- most of what you need to access is on the flash device and you can get to it very quickly. So this is such an obviously good idea that pretty much every operating system, except maybe Mac OS, has this built in at this point. Pretty much every storage vendor has a product that does something like this. Fusion IO sells a lot of caches, I think OCZ has some caches that sort of come with hard drives. Yes? >>: [inaudible] >>Mike Swift: So there is a block-level filter driver that acts a cache called b cache that sits under any file system or data base and will then transparently send blocks to either the flash device or the disk. So it’s a lot I think Turbo Boost, is that the Microsoft caching product, or Ready Boost? I forget which one. >>: [inaudible] >>Mike Swift: So I think it’s similar to Turbo Boost, which does transparent caching, the block level in Windows Vista and beyond, I think. So furthermore, big service providers -- Yes? >>: So this doesn’t take any management from the [inaudible]? >>Mike Swift: So today most people who are doing this don’t take advantage of the non-volatility. They actually, well in a way they only write out clean data. They’ll do write through caching. And then most of these systems will basically delete the cache when you reboot the system. So they don’t actually use the non-volatility. And I’ll talk a little bit about how we can exploit that. You could do it, but in a lot of cases people don’t trust the flash devices as much as they trust the disk, despite the fact that I know there’s data from Intel that shows at least their SSDs are much more reliable than disks. Intel had a study where they replaced all their laptop hard drives with disks and the failure rate with laptops went down by a factor of five. So they actually saved a lot of money putting in SSDs. Anyway, so a natural way to do this, as I said, is to do it as a block device because you can do this basically underneath the file system and you don’t need to rewrite all your code to take advantage of this. You don’t need to modify NTFS or EXT2 to do this. Basically the file system talks to a generic block layer and then in the block layer you put a filter driver, which acts as a cache manager. And the cache manager has metadata about all the blocks that are being cached and when a request comes in it can sort of decide, is this block located on an SSD or is it located on disk and where should I read it? So on a read it will consult either one. On a write, if it’s write through, it will write to both, or it can do write-back and sort of write the data to the SSD, in which case the SSD acts like a really large write buffer, absorbing new data as it gets written and then you can later de-stage it back to disk. So there are a lot of advantages to this, one is that it’s transparent to most software in the system because it’s at the block layer below everything else, and two you’re using a commodity piece, an SSD. So you don’t need to go build a special-purpose device for this. So you can buy any SSD and plug it into your system and take advantage of this. So for example, some of the hybrid drives out there just have a stock SSD built into the hybrid drive with some additional technology on top of it to do this caching stuff. So you can also do this sort of inside the device. So the downside with doing this though is that it, in my opinion, maybe it’s not just my opinion, is that caching is actually a really different workload than a storage workload where you’re actually trying to store data persistently. So one thing that’s different is the address space. When you are having a caching device, the address that -- every block that your caching has its own address on the disk already, which means that when you write it out to the SSD you need to remember what was the original address that this block was stored at. So that’s why the cache manager needs to have this table of every block that it’s caching that says for this disk address this is the corresponding address on the SSD. Furthermore, if you want to make the disk nonvolatile, you need to write that mapping table out. And that’s expensive enough that a lot of people don’t both writing it out. And that’s a big reason for sort of recreating the cache after a failure. Second is the issue of free space management. So caches already have to manage free space, they want to use as much capacity as possible and they think about what to evict when they get full. So similarly SSD’s are doing free space management because they’re absorbing new writes, they have to garbage collect old data, they have to figure out what to garbage collect and how to sort of manage the free space. So again there’s things going on at two layers that could be coordinated. For example, it might be beneficial if the SSD, instead of garbage collecting data that was about to be evicted, would just delete the data instead. If you’re going to evict it why bother the cost of moving it around when you’re not going to use it again? And then finally there’s this problem of cache consistency and durability. We might like the cache to be durable across reboots which would allow us to use it as a write buffer also and sort of take advantage of non-volatility. But because of the cost and complexity of persisting this mapping, this translation of disk addresses to flash addresses, a lot of flash caching products don’t do that. For example, I don’t think the Microsoft version or the Solaris versions of caches are persistent across system caches. So the key point here is that a caching workload looks very different than a storage workload and as a result we might want something different out of the device. Yes? >>: Are there rewrite workloads in your experience where you have the SSD from the disk that is actually harming performance? Like does it actually get in the way? Or does it strictly do no harm ->>Mike Swift: We haven’t seen that yet but I will say there are experiences limited to a small number of traces from the SNIA website which may or may not be representative. But I can definitely see if you have something that is just doing a lot of writes and you’re basically not reading anything, then writing it out to the SSD really doesn’t buy anything because you’re just going to fill up the SSD and then write it out to disk and it’s just sort of added expense and latency in the system. So I think that would be the number one case. >>: And latency does go up? Can’t do them parallel? >>Mike Swift: Well, the overhead goes up because you’re spending more software time managing the cache. So you can do the things in parallel. So to address this we built a system called Flashed Here, which basically starts with the normal caching architecture. The first thing we do is we take this SSD at the bottom and replace it with what we call a solid-state cache, which again is designed to be a commodity system that anybody can build with a standard interface, but it’s built to be a cache. So the first thing we do is we have a unified address space, which means that you can store data at its disk address directly instead of having to translate disk addresses into flash addresses. Second it has cache aware free space management, so it knows about which data might be coal data that you could evict instead of garbage collecting when you want free space. And third it has an interface where it has commands that are relevant to caching, for example, the ability to mark data as clean or dirty, or to forcibly evict data from a cache for consistency purposes. And finally, and something I don’t want to spend a lot of time on, is it also has strict consistency guarantees that allow you to reason about which data in the cache is usable following a system crash. Because you can know for sure that, for example, anything that you read from the cache is always the most recent version if you follow the protocol. On top of this we also modify the cache manager to basically use the system interface and we’ve implemented both a write-back and a write-through policy. Yes? >>: So you’re replacing the flash itself here as far as what you get to re-architect when you do this? >>Mike Swift: We are modifying the flash translation layer effectively. That interprets seta commands or skuzzy commands and then figures out what to do. >>: So only the [inaudible] say, you don’t have within your designing ability, say, making your blocks a little bigger or anything like that? >>Mike Swift: We have not done that at this point. We’re sort of trying to just sort of focus on that translation layer and the commands that come in over the seta interface. >>: Is that something that can be modified on [inaudible]? >>Mike Swift: It cannot be done. As I’ll talk about there’s a prototyping board where you can write your own flash translation layer. And we’re in the process of implementing this system on this prototyping board. >>: But for practical use ->>Mike Swift: For practical use our model is we would like OCZ and Intel to build devices that actually have this interface into it. That would be our productization goal. It wouldn’t be something that Microsoft could then do purely in software, unless there were vendors that let you rewrite their firmware and I think that’s unlikely. Yeah. Okay. So I now want to say a little more about the design and I’ll focus on the three biggest issues, which are sort of the address mapping, address translation, managing free space, and the new interface -- sort of the commands in the interface. So here [inaudible] it crocks the problem with address space that we basically have multiple address spaces going on. We’ve got disk addresses, so the host operating system has this translation from disk addresses to flash addresses. And then inside the FTL we have another mapping table that translates flash addresses into the physical location. And so we’ve got double translation going on. It really doesn’t buy us anything. Yes? >>: [inaudible] >>Mike Swift: So they have a, they don’t map every single block. They have a small translation table, more like a look-aside list, that says if you’re on this list. It’s very much fixized as a limited number of things it can remap. Actually if you look at this technology, though, there’s this technology called Shingle Writes, that allows you to basically double or triple your density by overlapping data, which allows you to write that you can’t overwrite data anymore because you write a track this wide but you sort of overlap tracks on top of each other like shingles. And so you can read this narrow strip of it but you write something three times wider. >>: [inaudible] >>Mike Swift: Right. So you can’t do intermits and so disks are likely going to sort of have the same problem of not being able to do random writes anymore without -What? >>: Do you think shingles are the ->>Mike Swift: So in talking to people at [inaudible] they think that shingles are going to happen because the reason to use a disk is lots and lots of storage. If you need performance you’ll get an SSD. And so their take is, you know, the disk will basically get really, really big and dense and then anything you really need random write performance from will use flash. And sort of further distinguish the two layers from each other. But I don’t know on my own, that’s just what I hear. Okay. So the approach that we take is to say, well, let’s get rid of the translation table and the cache. Let’s do the translation table directly in the device. So here you can send a disk address right to the device and we can translate it from the disk address right to the physical location in flash. The second thing we do is we change the data structure here. Instead of kind of a page-table data structure, which is good if you have a really dense address space, because you have a big page that has lots of translations and everything is filled in. In a disk this makes sense because the number of addresses you translate is the same as the number of physical locations you have, where as we have a system where we have the number of addresses you translate could be ten or 100 times larger. So we could have like a ten-terabyte disk system that we’re then caching with a ten-gigabyte cache. And so we have potentially a thousand times more addresses. So we use a hash map data structure instead, which is optimized much more for a sparse address space instead. And we have some evidence that about, that sort of caching workloads are likely to be much sparser than a normal storage workload, just because the hot data is going to be sort of more widely distributed. Okay. So the second thing we do is sort of change how we do free space management. So the normal way that free space management happens in an SSD is through garbage collection. So you might have multiple blocks here. And remember the rule is that you can write the sort of small pieces of a block that are called pages, but when you erase, you have the erase a much larger block that’s maybe 256 kilobytes or bigger. So when you do garbage collection, typically the process is like in the log-structure file system where you find multiple blocks that have some valid data. You then sort of coalesce this valid data into another block. And then you can erase a block and you get a free block out of it. And so the problem with this is that we basically have had to read a whole block, write a whole block, we have to erase two blocks because we’ve erased both of these. We have to start with one empty block to start this whole thing off. So it’s a pretty expensive process. Furthermore, every time you copy data, you’re writing to the disk and this sort of uses up its write capacity because it has limited endurance. There’s a second problem, which is perhaps even worse, is that the performance of garbage collection goes way down when you have a full device because you have a lot -- the chance of these blocks being sort of filled with stale things is much lower when your device is almost full. So this means that you have to read, you have to sort of copy a lot more data to make free space when you’re almost full. It’s sort of like when you defragment your disk and it’s almost full it’s a lot harder because you have to move a lot more data around. So there’s data from Intel that shows that at least for their SSD’s if you use them to their full capacity, then write performance goes down by 83%, as compared to if you reserve 30% of the capacity as basically spare. Yes? >>: A little clarification. What are the small boxes and the big boxes? >>Mike Swift: So the small -- sorry about that. So the big boxes are what is called an erase block. And this is the unit of data you can erase. So you can only do erases in a large block size of like 256k. You can do writes in much smaller granularity, which are like about 4k or so, sometimes larger. And so the idea is you can write to each of these individually, but you can only erase the whole thing. Sorry about that. It’s a good question. Does that make more sense now? Okay. So the other problem is that endurance, because we’re doing more garbage collection, we’re doing a lot more writes to move data around, our endurance goes down by 80% also. So this means that if you use a normal SSD for caching and you actually use the full capacity, it may not last very long because you’re doing all this garbage collection. And finally, because we’re doing caches, we actually want to have the device full. We want to get a good hit rate having as much data as possible within the device. So the approach that we take is to say, you know what? This is just a cache. We don’t actually need the data. So it might be beneficial sometimes to actually just delete the data instead of moving it around. This is actually an idea that I have to sort of credit Jason Flynn [phonetic] for. We were talking about flash devices and he said, you know, why bother actually storing data, let’s just delete data instead and things would be a lot easier. And that’s what we do. So to do this we have to make sure the device knows which data is dirty and which is clean. And once we know which data is dirty and clean we can find a block here that has only clean data. And when we need some free space we can just delete the data instead of copying it around. And the nice thing is we have now made an erase block without any extra reads or writes. We just have to erase the one block and we delete the data. And if the data is indeed fairly cold, then it has very little cost of doing this. So it can be much more efficient than normal garbage collection. So we’ve implemented two very, very simplistic policies here that really don’t even look at usage patterns at all. But we basically segment the disk, the flash, or the drive into two regions. One is called the log region, which is where new writes come in and they’re basically just sort of randomly accessed and things are just written in log order. So this is also to the hot data that’s being written and here we still do garbage collection because this data was recently written, it’s likely to be accessed again soon we find, and so deleting that data really hurts performance because we get a lot of read misses. This other portion is called the data region and this is colder data that tends to be stored more sequentially. And here we just evict this cold data when we need space. And in this model when we evict data we just recycle the blocks to become data blocks. So here we have a fixized log and a fixized data region. And an alternative is that sort of a variable sized log where, if we’re having a lot of writes we can make the log much larger than normal. And this means we have a larger region to sort of coalesce data from when we do garbage collection. And then we still will do eviction from the data region but we recycle the block to make a new log block. And we can also take a log block where if you write all the data sequentially, we’ll sort of convert it into a data block. The reason this is important is that we can also do the address translation for these reason at different granularity. So this is sort of randomly written new data we translate this at 4k. And this is sort of much more sequential data. So we translate this at an erase block granularity of 256k. And so this saves on the memory space needed for translation by partitioning things this way. And it’s a fairly common approach from what I understand in real solid-state drives too. Yes? >>: Any [inaudible] policy for bringing a block first time into the cache? Is it like when it is first accessed? >>Mike Swift: So we have two --I’ll get to that in a coupe of slides. Yeah, when I talk about the cache manager. The cache manager implements the policy of when to put data into the cache. Okay. So the next thing is the caching interface, which is what do we need to add to a solid-state drive to make it a better cache? So what we need is first some way to sort of do cache management to identify, which blocks are clean and which are dirty so we know what we can evict. And we also need some way to sort of ensure consistency, so for example, if you write data directly to the disk and bypass the cache, you might want to forcibly evict data from the cache so that you’re sure that it never stores a stale copy of data. You also might want to change the state of a block from being dirty to being clean when you do write it back. So to address this we basically take the existing read and write block interfaces and sort of tweak a little bit. First of all we have two variations on write where you can write data back and say that it’s dirty, meaning that you shouldn’t erase it on purpose. Or you can write it back and say that it’s clean, which means there is a copy somewhere else in the system, and if you need free space you can erase it. For read, the only thing that we do differently is that we now return an error if you read an address that isn’t there. So normally if you have an SSD and you read an address that has never been written to it will just make up some data and give it back to you. Typically zeros or something like that, but it could return whatever was written last time at that physical location. There’s no specification for what happens when you read unwritten data. Here you know for sure that something isn’t there. So the nice thing about this is that if you want to access the cache, it’s always safe to just read a disk address off the cache. And if it’s there you’ll get the data. If it’s not there you’ll get an error. You’ll never get invalid data. And then finally, for cache management we have two operations. One is evict, which invalidates a block. And the second is clean, which basically marks a dirty block as clean. >>: So it’s a little bit of a trade off with the read interface because I’d imagine [inaudible] and if you miss you just [inaudible]. It just means that ->>Mike Swift: So in this system we assume that this is sort of a stand-alone drive and it cannot go to the disk itself. So it’s not a stacked interface. The cache manager will first go to the SSC and if it’s a miss will then go to the drive. So that piece is all in the software. And the reason we want that is because it allows you to sort of plug this into any drive system you want. You can put it in front of a raid array. You can put it in front of a network file system. It can work with any form of block storage. >>: It seems like you keep just small bit of state in memory in [inaudible] to know whether or not the read exists. >>Mike Swift: Yes. So the nice thing is -- that’s a great point -- is that this is always correct. Which means that you can have imprecise information in memory, sort of like a bloom filter or something else and approximate data structure that says is it worth checking the cache? And if it is you can go and you’re guaranteed to get the latest data or nothing. And if it says it’s not there you can go right to the disk and avoid the latency of going into the cache. >>: Since it’s so cheap to go to the cache would you want to speculate what disk and then cancel the IO? >>Mike Swift: You certainly could. We haven’t looked into it. I know from talking to some people at [inaudible] that they really do worry about the cost of cache misses. And so they spend a lot of effort in not adding latency in that case. So that would make sense. Yes? >>: [inaudible] >>Mike Swift: Right. This is all implemented inside the SSC. So these are the commands that are being sent over the seta interface into the SSC instead of the normal read block write block. >>: So if you had more memory on the SSD type memories, opposed to flash, would that change the way you store the data structures? Could you do more if you had more competition? >>Mike Swift: We certainly wouldn’t go with this mechanism of sort of having the separate data region mapped t coarser granularity, we would not do because there’s a lot of overhead and sort of translating between 4k blocks and 256k blocks. But I don’t think we’d necessarily change it. As you’ll see, we do expand a little bit the amount of memory in the device because we now have a hash table instead of a page table, which is a little bit less inefficient. And we also have to start metadata about the blocks. So we have to store a bit that says is this block dirty or clean? If we had more data I think I’d add space for things like last access time also. But already we keep all of the metadata about all the blocks in memory, so that like a read miss is really just a memory operation on the SSC and it doesn’t actually have to go to flash. So that’s why I wouldn’t change too much. Yes? >>: [inaudible] >>Mike Swift: So we thought about that. And in talking to various people, and particularly Fusion IO, one of the things they said is they really find it’s important to have some management software on the device itself. And the reason is that the flash chips themselves are so crazy and hard to use that things like what is the frequency with which you reread a block or handling ware management is really important. And they find that by having this on the device they can actually improve their write endurance by a factor of two or three just by doing a better job on the device itself. And so for that reason we want to have at least some management code on the device. And our other goal was -- you know I think there’s a huge benefit to having a standard interface that you can just plug a device into, instead of having to write a device drive for operating system. And so we try to come at it with an approach of let’s not require a lot of OS code, as much as I love drivers and writing them, but I think that most storage vendors would rather have a device you can just plug in with no required kernel-level software, particularly for Linux where you have to rewrite it every three months for each new version of the kernel interface. So I think that’d be a great design but we sort of hose a different set of constraints. So finally we have an exist interface. So there’s a problem that if you do crash and you come back and you need to know which blocks are dirty so you have to clean them again. And so we have a command that allows you to probe for the existence of dirty blocks. So on top of all this we have a cache manager and we’ve implemented two caching policies so far. The first is write-through, where the cache manager stores no state whatsoever. And basically every write is written both to the SSC and to the disk so we cache all writes. And then on reads, every read goes to the SSC and if we miss we go to the device and we populate it. So we have sort of the simplest possible caching strategy here. Yes? >>: In terms of the dirty blocks, I assumed you [inaudible]? >>Mike Swift: So we actually -- this is the piece that I said I wasn’t going to talk about the consistency. So we actually do internally use sort of atomic writes strategies like the atomic writes from Fusion IO where they use ->>: [inaudible] >>Mike Swift: We don’t use a super cap. We basically, we don’t assume we have a super cap but we actually do synchronous logging of metadata updates when it’s important. And so that’s part of our performance model. And for writes we do assume that the device, Fusion IO has this idea where you can basically write a bit that says this is -- because you’re writing everything to new data you can actually detect a [inaudible] write because you can see have you ever overwritten this or is it cleanly erased? And so we assume we can do atomic block writes using that. That’s a great question. For write back it’s very similar. Reads are exactly the same, where we try both. On writes we will write it back the device and we populate a table in the cache manager that says this is a dirty block and we have sort of a fixized table here. So if we have too many dirty blocks, then we’ll start cleaning blocks and writing them back. And in our experiments we set this to 20% of the total size of the cache. And then finally when the system crashes, we use the exists primitive to sort of go and reread to find out what all the dirty blocks are. Okay. So that is sort of the essence of the design. Now, I want to talk about our evaluation. And our three goals for this are, you know, performance is obviously a good one, reliability, and then is there some efficiency savings in terms of memory consumption? So we implemented this. We took the Facebook’s flash cache manager, which they use in production to cache data in front of their my sequel data bases. We took a flash timing simulator that was published by Kim, et al. We do trace base simulation on a number of different traces from the storage network industry association. And here are sort of the parameters for our device. We more-or-less tried to set the parameters about like an Intel 300 series SSD, which is sort of a medium grade, latest generation SSD. So here are the traces. We have two traces at the top that are very, very write-intensive. They’re 80 to 90% writes. And we have two traces that are very, very read intensive that are sort of 80 to 90% reads, and so this sort of covers different aspects of their performance. Yes? >>: [inaudible] >>Mike Swift: The blocks are 4k blocks. So you can see this is a really small workload. And so what, actually I should mention this, that we size the cache to be able to hold one-quarter of the total number of unique blocks. And so we assume we’re going to cache the top 25% of things. And so the kind of things, to be honest, we could set this any size and get any performance we wanted. At one point the student made it so big that everything was in the cache and performance looked really good. And I had to say that doesn’t actually prove anything when everything is in the cache and you’re not running any of our code. So we figure 25% is reasonable. We get a miss rate that’s measurable so that we can see the impact on misses. So here is the performance numbers, in red is the SSD. We then have the two variations, sort of the two garbage collection or silent eviction policies, SSC and SSC-V with write through in blue and then write back in green. So here are the write intensive workloads. And this is where performance really gets helped because of the silent eviction policy. So what we can see is that particularly for write back, we get about 168% performance improvement, or 268 -no 168% performance improvement over the baseline system. And this is because we’re doing a lot less garbage collection. So we’re getting a lot of writes here and then once it’s being cleaned we can just delete the data instead of moving it around. >>: [inaudible] >>Mike Swift: Right. It’s the baseline system that is currently in production at usage at Facebook. >>: [inaudible] >>Mike Swift: Yes. So we’re still using cache and this is not compared to a disk, this is still with an SSD, sort of the same performance model of SSD as a cache. And you can also wee that for a write-heavy workload, the write back definitely out-performs write through because we’re basically using it as a write buffer. So if you overwrite something it’s sort of one less write that has to go to disk, which is pretty much what you’d expect. For read performance, you know performance really I say this is unchanged. And the read path is basically the same. We’re not going to make flash act any faster. We’re still translating addresses, so that doesn’t help. So what this really measures is what is the impact of our really simple eviction policy, where we’re willing to evict any clean block that we’re storing. So it’s basically a random eviction policy, and we can see I think we have about a 10% miss rate, that even then we still get a very good hit rate on the read-heavy workloads, even though it’s so simple. Yes? >>: On the [inaudible] because flash is better for media, you probably want to have an [inaudible] policy where you enter the block in the flash cache only after it proves itself to be hot enough. >>Mike Swift: Right. I think that would be a good idea because we could definitely bring it in the page cache in the operating system, and then if it’s referenced a number of times, then write it back into the flash cache, instead of just putting it there on the first access. So I think that’s a great idea. >>: Didn’t you also have a double translation? >>Mike Swift: We do get rid of the double translation but that’s not a big expense. In the host memory it’s a simple hash table look-up that takes, like, 20 nanoseconds or something. And our translation we timed the speed of using a hash table instead of a page table, and we see that it adds like five nanoseconds onto the time to do a translation. >>: If you knew that you had to put this on the firmware, on the flash, on whatever persistently [inaudible]. The performance tried also things like looking up the page table versus looking up the hash table made a big difference. >>Mike Swift: Right. So we are currently implementing this there. And I agree that a low-end arm process -- I think the device we have is a low-end arm processor and so I do think that there might be a bigger difference. But, you know, even we were 10 times slower, the cost of that translation is so much less than the time to actually access flash, that I don’t think that’ll add a noticeable amount of latency. It might mean though, that we do need a slightly beefier processor to keep up with the translation. >>: And do you know, I mean this is some sort of prototype, which you’re using, I don’t know if the actual production model where [inaudible]. >>Mike Swift: This is a commercial SSD controller. So it’s the Indilinx controller from, I forget. I think Indilinx is a company that makes controllers that OCZ and Intel use in their products. And so this is a commercial controller but we can provide the firmware for it. Do you have a ->>: I’m just curious. How often do these reads get [inaudible]? >>Mike Swift: So this is a trace replay. So this is basically taken beneath the buffer cache. So this is just the thing, the traces are just the things that already went to disk. So when we have the real prototype working, we’ll have a lot more information about how this interacts with the buffer cache. Yes? >>: [inaudible] >>Mike Swift: We are using some of those, I believe. >>: So those are unmodified applications? So they already have large RAM caches. So [inaudible]. >>Mike Swift: Right. That is totally true. And, you know, I don’t think using a flash cache would necessarily change the fact that they’re using a large RAM cache, because there would still be two orders of magnitude better performance from using a RAM cache instead. But it is true that if we did reduce the amount of caching in those applications we would see a different trade-off here. Definitely. >>: [inaudible] applications have to use RAM cache because their only other alternative is hard disk. But with the other flash cache maybe they don’t need that big of parameter. >>Mike Swift: Yeah. I think that’s a great point, but it’s not something we’ve investigated yet. But I do think the -- we looked at a work on using flash for virtual memory and we found that when you’re swapping to flash instead of disk you could dramatically reduce your memory usage, in some cases by 80% for that exact reason because the cost of a page fault was so much less that you could basically swap a lot of your data out instead of keeping it in a DRAM. And I think definitely the same thing would apply here. >>: One more question. [inaudible] from flash to hard disk you tried to take advantage of a sequential[inaudible] to hard disk in terms of choosing which blocks to emit? >>Mike Swift: So we do not at this point. So basically we do whatever. We’re using Facebook’s flash cache software to choose which blocks to evict. And so my guess is they already tried to keep it sequential but we didn’t modify that. In our performance model we basically assume that all disk accesses take half a millisecond. And so it’s a very simple performance model. We’re not sort of looking at the locality of the disk right here. But that would definitely be an extension. Okay. So the second thing we looked at is endurance. Does sound eviction help with endurance because we’re not doing as much garbage collection? So here we show, what we measure is the lifetime, in years, of the device. So this is a device sized to hold 25% of the unique blocks. We look at the duration of the trace to calculate the frequency of rights and then therefore the frequency of erases, and we figure out at this level of workload how long would the device work? Assuming that you could do 10,000 overwrites per bit sort in flash. And so what we see is that there is a reasonably good improvement in lifetime here because of silent eviction because we’re not doing garbage collection. And in the mail case I think it goes from a place where really one and a half years for a device is probably too short from a management perspective to want to use caching up to 2.8 years where maybe it would be worthwhile because you wouldn’t be replacing the device. And I will note that this is a very small workload and so you take it with a grain of salt. In the case of reed caching we see that here the first most important thing is because this is a really big workload, endurance isn’t a problem. There’s just a lot of its to spread writes over and there’s not that many writes, so endurance is very high, 200 years or something like that, and the endurance does go down a little bit. And the reasons for that because we’re evicting data that gets [inaudible] reference we do have to read in data and do extra writes we wouldn’t have had to do otherwise. But because it’s so rare relative to the size of the device it doesn’t impact lifetime that much. >>: [inaudible] ->>Mike Swift: Yes. So we do garbage collection sometimes. But if we go back here, this performance drop here is because of the hit ratio. So we’re losing about 2% of our performance because we’re not hitting in the cache as much. And I think if we had an eviction policy that was more sensitive to usage that actually used recency of access instead of random, we would improve this. And so the final thing we look at is memory usage. And here for the write-intensive workloads we have in blue the amount of memory on the device used to hold the translation table, and then the amount of memory used in the host to hold whatever data it needs. And here we only look at write back, because with write through the cache manager didn’t store anything, so the red bar would be zero. So the ratio of these things is the same across all workloads. So we can just take one for example, and what you can see is that with the normal SSD there’s a small amount of data in the device, this is very compact it’s basically a page table containing every block on the device. But the host has a lot of memory because it has to store a hash table that keeps track of every block stored there. In the SSC case, the device memory goes up because we’ve gone from a page table structure to a hash table, which is slightly less dense, and so we increase memory usage in the device by about 10%, but the data usage on the host has gone down by 80% because we’re only keeping track of the dirty blocks and not all the clean blocks also. And that holds across all the different workloads. Andrew. >>: So if I’m going to build one of these things how do I decide how much memory to put in it? If I’m going to buy one do I have to trade off [inaudible]? >>Mike Swift: So what you would say is when you build the device you know how many blocks it has physically, and you need a hash table that can hold mapping for that many blocks, and that’s how you would size the memory on the device. So in the host, similarly in the host you’d want to reserve enough memory for whatever your fraction of dirty blocks is. So remember ->>: [inaudible] >>Mike Swift: So the reason is we size the cache differently for each workload according to the size of the workload. So this is sort of a one-gigabyte cache and a five-gigabyte cache and a 100-gigabyte cache or something like that. So that’s why the amount of memory you need is different, because the number of blocks you’re caching is different for each workload. That’s a good question. Okay. So as I mentioned, this was all done in simulation. This is the board that we’re currently using. It’s the open SSD prototyping board, and it has this Indilinx [inaudible] SSD controller. It has some memory, some flash, and you can basically write your own FTL to go into here. It comes with three FTLs and it performs relatively well so that I think it provide useful results. And so my student has implemented the stuff, and he says that it’ll all work, but he hasn’t actually tried it yet. So we’ll see what happens. But it’s nice because I think until this year it was very difficult to actually build an FTL and do anything. I know Microsoft research had a big effort to build something on the b-board, the b3 board or something, and this is a lot easier to just use a commercial controller. And I think it’s probably more representative. It’s limited to what you can do because everything is sort of commodity parts, but at least the FTL research this works pretty well. Okay. So in summary, the take of this research that sort of making flash look like a disk is great for compatibility, great for getting devices out there, but there’s a lot of neat things about flash that get hidden when you do this. And one thing we can look at is, are the there better interfaces? So we’ve looked at one interface, which is really designed for caching, which makes it easier to write a cache and perform better, in some cases reduces memory consumption. And, you know, I think there’s other opportunity to look at other interfaces-to-flash sort of tailored to specific applications, or even an interface-to-flash that sort of subsumes this and handles other applications as well. Yes? >>: [inaudible] do you use the non-bulleted profit to do fast recovery like instant on cache? >>Mike Swift: Yes, we do. So we have a write back and a write through. So in all cases actually the cache is non-volatile and so as soon as you turn it on everything is accessible in cache. And I have a separate slide, I don’t know if I have it with me, that shows the recovery time is half a second or something like that, whereas the Facebook flash cache, which had a non-volatile option, took about 30 seconds to a minute to recover, because it had to reload all the metadata into memory. So we do have that capability. Yes? >>: Earlier you talked about the limitations of flash because you plan to leave 20 to 30% empty. >>Mike Swift: To get high endurance. >>: To get high endurance, yes. So have you done experiments where you could, or do you with this system you could keep the flash 90% full and not take the big [inaudible] hit, or is that just fundamental? >>Mike Swift: So I think it depends a lot on the workload. But what we can do, I think is more interesting, is we no longer have a fixed capacity device. We can decide for a workload how much of it can we actually use for live data. If we have a very static read-intensive workload where we’re not doing any writes, we could use 100% of the capacity because we don’t need to reserve anything for incoming writes. So there’s one copy of every block. So you can use all of it. If we have a very writeintensive workload, what we want to do is absorb writes as quickly as possible, like a write buffer, and coalesce things that are overwritten. And then we might use only 30% of the device or 50% of the device. So really what we want to do is size the device to optimize performance, rather than hit rate or something like that. >>: Facebook’s cache they do the same thing. They can use 100% [inaudible] right? >>Mike Swift: Well, the device internally actually reserves 7% for logging purposes and things like that. Yeah. So but you can definitely statically, I mean Intel recommends if you’re doing a cache you should statically reserve, create a partition that is 30% of your device and not use it more-or-less, to get optimal performance. And this would allow you to dynamically react to a workload. Okay. So I think I have about five or 10 more minutes or so. So I’d like to go on. I will try to not go through this fast, but I’ll try to cover the highlights in the time that I have remaining. So the second piece is looking at what do we do with storage class memory. It’s a really great technology that can be the dream of system architects of having persistent memory that’s really fast, perhaps. And the question is how should we expose it to applications? So if we look at what we did with disks. With disk we typically have a pretty deep stack of software that we have to go through before we access the disk. And if you think about it, there’s actually a really good reason for this. So at the bottom we have devices that have no protection mechanism. So the disk can’t tell which processes are allowed to access which blocks. So we need to go through the operating system, which actually implements protection through file system [inaudible]. Similarly, we use DMA to access it. DMA operates on physical addresses. You don’t want user mode code doing DMA. It’s not protected. Again, we need the kernel to handle the data transfer. We have a device with incredibly variable latency, and so having a global scheduler that can reorder things can really optimize performance and get you ten times better performance if you do things the right way. Similarly, because it’s very slow having a global cache where we cache commonly used data across processes makes a big difference. So for disks this is a great architecture and I think it’s worked really well over a lot of years. If we look at using storage class memory, we can see that things change a little bit. So at the bottom, if we map it as memory, we have hardware protection. So we can use virtual memory protection bits. So we no longer need to go into the kernel to implement protection. Because we’re accessing things directly with load and store instructions, we don’t need a device driver that mediates all access and so we can get rid of the need for sort of a global device driver. Because we have more-or-less constant latency, we probably don’t need an IO scheduler to reorder requests, and so we don’t need another piece of global thing. And then because the latency could be as good as DRAM, we may not need as much sort of shared caching across processes because it’s just not that expensive to go fetch things directly from SCM. And so what this means is that we think that all these things that said file systems have to go into the kernel to access data may not be true with storage class memory anymore. And so what we set out to see was could we build a way that you could expose SCM directly to applications as memory using memory protection to control access? So to address this we build a system called Mnemosyne [phonetic]. The student who did this is Greek, and so we were told to not use his name because it was too hard to pronounce. But it turns out that the Latin word for Mnemosyne is Moneta. And Steve Swanson’s group had a -- their PCM prototype disk was called Moneta. So that was already taken. Clearly we were consulting the same source of names. >>: There’s another [inaudible] with this name from STI ->>Mike Swift: Okay. Yeah, right. >>: [inaudible] >>Mike Swift: Right, right. Well, Mnemosyne is the personification of memory, which is a great -- we tried Indian words also but we couldn’t find one we were comfortable with. Anyway, so the goal here is that first we have a persistent region in your address space where you can put data that survives crashes. And then we also have a safety mechanism so that if you’re doing an update in that region of memory and you crash you don’t get corruption. Kind of like filing systems do journaling and shadowing updates to prevent corruption. So the basic system model we have is the picture I showed before. We’ve got DRAM and SCM both attached to your system. In your address space we now have regions that are volatile, regions that are non-volatile. You can put data structures in them. If there’s a crash and something bad happens the volatile data goes away, but your non-volatile data survives. So this is the basic mechanism we want to give programmers. So on of the key challenges here is what if you’re in the middle of doing an update like inserting into a linked list, and you’ve got your update partially done and something crashes before you fix the data structure. When you restart the system you’ll find that your data structure is inconsistent and you either need to walk your data structure to figure this out, or you may suffer with incorrect data. So this is what we would like to help programmers with as part of this. Yes? >>: Is it still slower than DRAM? >>Mike Swift: It is a little bit slower than DRAM, and it’s getting slower than DRAM every year. It’s getting slower faster than DRAM. DRAM is getting faster, faster than it is getting faster I guess. >>: [inaudible] >>Mike Swift: Yes. So actually I talked -- I agree there are definitely some system integration issues. I’ve heard the processor pipelines do really poorly with variable latent, long variable latency events. So if you have a write that takes a microsecond, you know, your pipeline is not built to have things that are outstanding for a thousand cycles or two thousand cycles. So integrating this into the system is not -you know when we started this work PCM looked a lot more promising, and the speeds were going to be half the performance of DRAM instead of a quarter of the performance. And so it made more sense. But we still -- you know, the read performance I think still looks pretty promising. It’s the write performance that really would be an issue. And the way we handle writes we actually sort of encapsulate the write separately. So we can actually do the writes off the processor pipeline. Well, I’ll show you how we do that to sort of not affect other instructions. But I agree that it’s definitely an issue. What? >>: [inaudible] >>Mike Swift: Um ->>: [inaudible] >>Mike Swift: Is there what? >>: The reads are comparable? >>Mike Swift: The reads are projected to be a lot closer to DRAM reads. The writes are projected, compared to what the writes are. Currently the projections in writes are like a microsecond and reads are like 200 nanoseconds I think. Ed could probably tell you more about it if you have questions. Okay. So our goal is basically to make this consistent. So the first thing we have is a program abstraction, which is basically a persistent region where you can declare variables, global variables, to be static p-static variables, which means they’re like static variables that survive multiple calls into a function. These variables survive multiple implications of the program. And this is really designed for sort of single instance programs like Internet Explorer or Word, where you’re only allowed to run one copy of it at a time. And if you try to start a second copy it will sort of kill itself and redirect you to the first copy. So this is a pretty standard model for applications that manage their own persistent state to begin with. We then also have a dynamic way to create this using p malik where you can basically do heap allocation. And our goal here is that you can allocate some data with p malik, hook it in to some kind of data structure, and then the p set of global variables are how you name this data and find it again after a reboot. So that’s sort of your name, this is how you name persistent data and get access to it again. So the key challenge here is how do you update things without risking corruption after a failure? And so here is an example of the kind of thing that can happen. So suppose you are writing, you have a data structure that has two fields, a value field and a valid field. And the invariant is that the new value shouldn’t be used unless the valid bit is set. So suppose you first set the new value. The old valid bit was zero. So we know this, for example, is invalid data. We then set the valid bit. So now in the cache we see we have a new value and it’s valid. The challenge we have is that the data only survives a failure if it actually gets all the way back to SCM. If we crash and the data is in the cache it gets lost. So this is a lot like your buffer cache of disk pages. If they’re in memory you don’t get to keep them, if you write them to disk, you’re good. The challenge here though is that we don’t get to control when things get written back. The cache itself decides when things get written back, and it can write them back in any order. So the processor cache may evict this valid bit before it evicts the value bit. When we have a crash failure now, what we see is that we have inconsistent memory in SCM and our invariant is broken. Andrew? >>: This sounds a lot like [inaudible] processor consistency. >>Mike Swift: It is. >>: Can you have things like [inaudible] ->>Mike Swift: Yes, you can. It’s the same basic ordering problem that you have to solve, except instead of being visible to other processes, it’s basically being out to memory. Exactly. So the basic thing that we need is a way to order writes, because ordering writes allows you to commit a change. If you have a log you can say the log record is done. If you have a transaction you can say the transaction is committed, a shadow up that you can say here’s the new pointer. So we need a way to sort of order writes. And our goal is to do this as much a possible without modifying the processor architecture. There’s things you could do, for example, there’s some work that I think Ed was part of, that looked at can we introduce epics into writes the cache that say this is the order in which things must be evicted to preserve this ordering. And our approach instead is to say, well, fundamentally we need is a flush operation, which says, force something out to PCM and then a fence operation that says wait for everything that’s been forced out to memory to actually complete. So it’s a lot like asynchronous IO where you’re doing an F-sync and then your waiting for it to complete. And with this we can sort of order any pair of operations and make sure that they actually become persistent in the right order. So if we do this we sort of can make sure that we forced this value our before we actually forced the valid bit out. So if you’re not familiar with these instructions move NTQ is a non-temporal store. It’s meant for streaming large amounts of data out. And it isn’t ordered with respect to normal store. So it doesn’t obey TSO or sequential consistency. And so it works better, I believe, for long latency operations. And it basically is also write combining so it’s the best way to push a lot of data out to memory. And then CL flush just flushes a cache line. So this is useful if you’re kind of updating a data structure it the cache and then you want to sort of commit it at the end. This is good if you know that you’re writing a log and you want to force your log out, and you’re not sort of writing it in pieced or something like that. Okay. So we have these primitives that allow us to do ordering. And on top of this, we think this is useful but it’s not what programmers want to use. So the first thing we built was a set of data structures basically we have a persistent heap we use for p malik that allows you to allocate in free memory and it keeps track persistently of what’s been allocated, what’s been freed. We also have a log structure here that handles things like torn writes. So if you’re in the middle of writing to the log and you crash, we can tell which log records are complete and which ones are not complete with very high speed. And then on top of this we have a general-purpose transaction mechanism where you can put arbitrary code within a transaction and whatever within that code is referenced, as persistent memory will actually be made atomic and consistent and durable if the transaction commits. So I’ll show you a little bit about how this works. So here we actually just reuse existing transactional software memory technology. We use the Intel software transactional memory complier so you can put atomic keyword into your program. Here we can put our two updates in there. The compiler will generate instrumented code with calls out to a transactional runtime system to begin and commit the transaction. And then every store will also be instrument with the fact that this data was updated and what value was written out there. And there’s a runtime mechanism that implements a write-ahead redo log. So everything that gets stored is written to a redo log, and then in commit we sort of force the log. And then we can lazily actually sort of force out the data itself because we have this log around. So by doing this we get sort of full asset transactions for memory. And the nice thing is this also, because we’re using transactional memory, there’s a standard problem in persistent memory is what if you have a lock in persistent memory? And you write out the fact that something was locked and then you crash. How do you figure out to sort of clear all those locked bits afterwards? So the recover virtual memory from [inaudible] had a special mechanism to fix all this. But transactions avoid the need to actually have any sort of state about what’s locked within a data structure itself. So here is sort of a high-level picture of how you might use this to sort a hash table. We have a static level variable that we can use p malik to allocate space for. One thing to note is that p malik takes the address that you’re allocating memory to as an out parameter. And this is important because there’s a race condition where you might allocate memory and then crash before you assign it to your static level variable. And making an out parameter means that internally the allocator uses a little transaction to allocate the memory and atomically assign it to the pointer. So we can guarantee you won’t leak data if you crash in the middle of this. To insert into the hash table you basically begin a transaction, you can allocate some data, put in the key end value, insert it into the hash, and if you crash at any point within this the transaction system will sort of wipe away everything and make sure that when you restart all of these changes don’t show up because you never committed the transaction. So that’s kind of the big picture of what we would like to enable. So our momentation [phonetic] is pretty obvious what you might expect. So evaluation, we did not have any SCM available or PCM to do this. So we have a very simple performance model. We’re really focused on the latency of writes. Does this make it faster to write data? And so we basically, because all the writes are going through a transaction mechanism, or through our primitives, or our data structures, we basically instrument all of the writes to SCM with a delay, a [inaudible] delay. And we default to a very aggressive 150 nanosecond added latency on top of accessing DRAM. And then for comparison we compare against running EXT 2 on top of a RAM disk. EXT 2 is pretty lightweight, so it performs pretty well in this configuration. And also here we’re largely looking at data within a single file and not metadata operation. So it’s really just moving data -- sort of doing mem copies back and forth. It doesn’t really stress the metadata, which is where a better file system might have an impact. So our micro benchmark is just looking at a hash table. This is a comparison of a hash table my student downloaded off the Internet and made persistent using transactions against Berkeley db, which is the thing that everybody in the open source community uses for comparison. What we compare here is we have Berkeley db and we look at three different granularities of data, an 8-byte value, a 256, and a 2k value. And we have three different latencies for PCM, 150 nanosecond, 1 microsecond, and 2 microseconds. And what you can see overall is that for small data structures or small updates and for fast memory we do really well. And it’s because we really get very low overhead access to memory. As we get to longer latencies, all the optimizations that Berkeley db does for disk about sequential writes actually help because PCM looks a lot more like a really fast disk at this point. Similarly, for small data structures, the transaction mechanism is pretty fast. We’re copying small amounts of memory. But for larger data structures sort of copying things a byte at a time through the software transactional memory is less efficient than doing big mem copies like Berkeley db. And so we don’t perform as well at that large case. So we’re really optimized for sort of fast memory and very fine-grain data structures. So I will just sort of show the highlights of an applicational workload and the net result is for applications that are really sort of storage bound we can be a lot faster. This is a program that uses m sync to force an in memory data structure out to disk and we replace m sync calls with transaction and it’s a whole lot faster. This is a lightweight directory server kind of like the active directory. And here we have some performance benefit but not as much. And mostly because the disk is so fast, or the PCM disk is so fast that there’s not that much to speed up in this case, but it still does make a difference. So any questions on Mnemosyne? Okay. So I want to say a little about where we’re going with this. So Mnemosyne is really focused on a single application with a single or a couple regions of memory. And we want to see can we extend this to the whole file system? So Ed did some great work building a kernel mode file system optimized for storage class memory. And we want to say, well, if we can take storage class memory and map it into your address space, can you actually sort of access the file system without calling into the kernel? Can you just sort of have your user mode programs directly read file system metadata out of memory and find data on their own? So we’d like to be able to support sharing, where you have multiple programs sharing data, perhaps at the same time. We want protection so we can control who can do this. And we want to rely on memory protection to implement this. And we also want naming where you can have a normal hierarchical name space to actually find things. And the benefits we see is one is performance, because we can avoid calling in to the kernel, which has some overhead, and second flexibility, where it could hopefully be a lot easier to design an application specific file format, because you could potentially just writing code at user mode, where hopefully it’s an easier environment. You can sort of customize the interface or layout for your application. So we’re working on a prototype and this is what it looks like. In the kernel we have what we call the SM manager. And all this does really is allocate memory, pages of memory or regions of memory. It handles protection so that it keeps track of which processes can access which data. And it handles addressing by mapping this data into user mode processes address space. And that’s all it does. There’s nothing specific to a file system here. We then have a trusted server which implements synchronization in sort of a lease manager. This is a lot like [inaudible] if you’re aware of it. Yes? >>: [inaudible] >>Mike Swift: Yes. I’ll get to that in a second. So we have a lease manager that implements sharing. So this is a distributed lock manager, pretty straightforward. We have a trusted server here, which handles metadata updates. And this is the only thing we really centralize because we can’t trust individual applications to correctly write metadata because they could corrupt the metadata and affect everybody else. So we centralize those things. The applications though, have a shared name space here, and they access it by basically figuring out which page of memory holds the superblock more or less, and then reading and parsing the file system metadata structures right out of memory and finding whatever data they want. And so far we have no caching whatsoever here. We’re just, every time you open a file, we go and more or less reread memory to figure out where to find it. And the nice thing is that, basically, if you doing a ready only workload, the application can run completely within its address space without any communication at all, except to acquire some locks initially to prevent anybody from modifying that data. So it has the potential to be a lot faster and to reduce communication in a multicourse system. So we have some preliminary results. Using file bench and a file server profile we’re about 80% faster than calling into a kernel mode file system. We also implemented a customized key value file system where the interfaces, instead of open read write close, is really just get and put. So we’re bypassing creating any transient end data structures. And we also sort of support fixized files, like, suppose you’re storing image thumbnails or something like that. And here we’re about 400% faster in a web proxy profile. So this is very preliminary and of course we’re not doing really complicated access controls like [inaudible]. But it kind of gives us a sense that there is something to be had here. So in conclusion, I think these new storage devices really have a lot of new capabilities. And making them look just like block devices conceals a lot of that interesting features. And the goal of my work is to really figure out can we find new ways of accessing those devices or making them available where programs can actually get sort of direct access to their features and benefits. So with that I will stop and take any remaining questions. And thank you all for coming. [applause] Anything left? >>: It appears not. >>Mike Swift: Okay. Great. Well, thank you all.