Document 17881921

>> Chris Hawblitzel: All right, so it's my pleasure to welcome Adrian Caulfield. Adrian got his bachelor's degree here in Seattle here at the U-W, and he's getting his PhD from UC San Diego, advised by Steven Swanson, and he'll be talking about designing or redesigning storage systems for fast non-volatile memories. >> Adrian Caulfield: Thank you for the introduction, and thank you for inviting me up to interview here at MSR. So over the last few years, in the Non-volatile Systems Lab, we've been working on integrating emerging non-volatile memory technologies like phase-change and spintorque transfer memories into systems. And what we found is that as these technologies drive latencies from seven milliseconds or so, like we would have with disk drives, down to a couple of microseconds, the software overheads that we experience are actually skyrocketing. So we go from about 1% software overhead going through the kernel stack all the way up to about 97% with these very fast non-volatile memories. And so I'd like to first set the stage a little bit with some sort of background on where storage systems are going, why this is an interesting problem, and then I'll walk through a couple of different iterations of the prototype SSD storage system, Moneta, that we've created. And then I'll talk a little bit about what direction I think some storage research should be going and some future ideas that we could look at. So we're really living in the data age now. The world's collecting data at an astounding rate. I'll just give you an example of how this is going. In 2008, we processed about nine zettabytes of data. So this is 10 to the 21st. It's a phenomenally large number, and we have huge scientific applications that are generating data at very fast rates. The Large Hadron Collider, for example, generates terabytes of data with every experiment they run. Large astronomical surveys are doing nightly sky surveys that generate petabytes of data, and we have to be able to process this information. Websites like Bing, Google, YouTube, Facebook, are all collecting lots of user-generated content, as well as large indexes of the web, and so we really need to be able to start extracting a lot of knowledge from this data that we're collecting, and it turns out that storage performance is one of the major bottlenecks holding us back from being able to do this. But new storage technologies like phase-change memory and spin-torque transfer memories can actually help solve this problem, as long as we're careful to not squander the performance that they offer. So if we look at the trends in storage technologies over the last couple of years, starting with hard disk drives -- these numbers are for an array of four hard disks, but we can get latencies of around seven milliseconds. Random access bandwidth reading four kilobytes of data at a time gives us bandwidth of around 2.5 megabytes a second. This was sort of the case for the last four decades or so, up until about 2007, which saw the introduction of flash-based PCI Express SSDs. And these devices significantly decreased latencies to around 58 microseconds. They've increased bandwidth significantly, to about 250 megabytes a second, so this is about a 100X improvement overnight from what we had with hard disk drives. And if we continue down this road, devices like I'll talk about today, which might be commercially available around 2016, have latencies of around 11 microseconds. Bandwidth goes up to about 1.7 gigabytes a second, mostly constrained by the interconnect that we're using. And so we can get 650X improvements for both of these, latency and bandwidth. >>: Just a quick one. So PCIe flash, is that the same as an SSD? >> Adrian Caulfield: Yes, so this is an SSD attached to the PCI Express bus, so think Fusion-io or something like that. >>: This feels a little misleading to me, because if you're really doing big data, you're not reading it 4K at a time, and on sequential reads and writes, there's only about a 2X difference between PCIe and a hard drive. >> Adrian Caulfield: Sure. So for sequential accesses, the gap is smaller, but these new technologies are still going to give us significant improvements. For random accesses, which a lot of applications require, especially if you have very large data sets that you need to sort of query and pull various bits of information out of, these numbers are certainly thins that we've measured in the labs with workloads that we have. So, depending on your workload, they're going to change. So if we do the math here, between 2007 and 2016, it works out to about 2X a year in terms of performance improvements for both latency and bandwidth, and so this kind of scaling is actually better than what we've seen with Moore's law with CPU performance improvements during its peak. So the types of memories that I'm talking about are faster-than-flash nonvolatile memories. These are devices that have interfaces that are as fast as DRAM or nearly so, maybe with a factor of two or three off. >>: Can you go back one slide? This all seems great. There must be some downside to all this, right? >> Adrian Caulfield: Yes. >>: What is it? >> Adrian Caulfield: So I'm going to get to that, but one of the big downsides here is that software overheads are actually going to limit the performance that we can get from these devices, unless we do something about it. And so I'm going to walk through some of the ways that we've been able to tackle that problem with our prototype SSDs. So the memory technologies that I'm looking at are things that are as fast as DRAM. They're as dense as flash memory, or they will be soon. They're non-volatile, they're reliable, and they have fairly simple management requirements. We don't need a large, thick management layer like you would with flash memory or something like that. So phase-change memory, spin-torque MRAMs and the memristor are all pretty good examples of the kinds of memory technologies that we're looking at, and at least one of these is going to be commercially available within a few years at the performance characteristics that we're looking at. But the challenge here is that the relative cost of the software overheads that we have on top of these devices are actually staying roughly constant, or at least they are at the moment. So this graph on the y-axis shows you latency on a log scale in microseconds, and then for disks, flash, and our fast non-volatile memories, we've broken down latencies into file system overheads, operating system overheads. If we wanted to share these devices over a network with something like iSCSI, that overhead's there. And then the red bar represents the actual hardware latency for these devices. And so as we go from disk drives, we have a situation where the hardware overheads are about two orders of magnitude larger than the software overheads that we're experiencing, to something like flash, where we're actually fairly well balanced. And when we get to fast nonvolatile memories, the hardware latencies are actually about two orders of magnitude less than the software overheads that we're experiencing. And so this works out to something like 4% software overhead for disk, all the way up to about 97% software overhead for our fast nonvolatile memories. >>: That 97% has iSCSI in the denominator. >> Adrian Caulfield: It does, but even if you remove the iSCSI, the software overheads are still pretty high. And it's also a lot of the software limits the parallelism that we can get from these devices, as well. The performance is more than just the latency numbers. So to help us understand this problem a little bit better, we built a system called Moneta. The sort of cartoon representation of the application stack looks like this. Applications run up at the top in user space. We have a file system, Linux IO stack in the middle, Moneta device driver underneath that, knows how to talk to our device and has some optimizations included. And then this all runs on a sort of standard X86, 64-bit host machine, and we have a PCI Express connection connecting this device to the host machine and several banks of non-volatile memory technology inside Moneta itself. >>: So Moneta is a piece of custom hardware you guys have? >> Adrian Caulfield: Yes. It's all an FPGA-based prototype, and I'll go over what this looks like. So the Moneta architecture runs on an FPGA board, and just to sort of explain this, I'll walk through a read request as it sort of travels through the hardware. So, once it's issued by the driver, the request is going to show up as a PIO write to a PCI Express register. It goes through a virtualization component, which allows us to have the appearance of many independent channels, so the applications can each talk to them. Once it's gone through that, we have a permissions-checking block so that we can actually verify that applications that are issuing requests are allowed to make the requests that they're generating. And from there, the request will get placed into a queue. Once the space is available in our scoreboard, we'll allocate space there and allocate space in our transfer buffers for the request, as well. We can track 64 in-flight operations in the scoreboard at the same time. Since this is a read request, we'll send out a message across our ring network to one or more of the memory controllers, and they'll send the data back to the transfer buffers. Once it's there, we'll issue a DMA request out to the host machine and the data will show up in the DMA buffer allocated by the operating system and we can complete the request there. We'll set a bit in the status registers, and we can issue an interrupt to notify the operating system that the request has been completed. So it's a fairly straightforward architecture. We're trying to just move requests through this as fast as possible and keep as many of the memory controllers as busy as we can. This whole design runs on top of the B3 FPGA board, and so this is actually designed in part by Microsoft Research. It uses a PCI Express 1.1 8X connection to the host machine. This gives us two gigabytes a second of bandwidth in each direction. The whole thing runs at 250 megahertz, and we're actually using DDR2 memory to emulate the non-volatile memories that we're looking at, and we can adjust the RAS and CAS delays and the precharge latencies that the memory controllers are inserting to match the PCI latency projections that we want to match. So we're using projections from ISCA of 2009, which said PCI latencies will be around 48 nanoseconds for reads and 150 nanoseconds for writes. And this board actually has 10-gigabit Ethernet connectivity, as well, and we'll use that to sort of look at how we can extend storage out onto the network, as well, a little bit later. So, as we've been developing Moneta, we've come up with sort of three principles to help us guide the development of these hardware devices and keep software overheads manageable. So the first is to reduce software IO overheads, we want to get rid of as much of the existing IO stack as we can that's been optimized heavily for disk drives over the last four decades. And then the second principle is to refactor critical software across the hardware and the operating system and user space. So things like the file that we can't get rid of, we want to split those across all of the layers of our IO stack and put them where they can be executed most efficiently. We also want to recycle existing components so that we can reduce the engineering cost, as well as make it easier to adopt the systems that we're going to develop. So this graph on the right gives you some idea of the latency reductions that we've been able to achieve. I'll go over a few of these in a little bit more detail, starting with the reducing of software IO overheads. So the first optimization that we made is actually to completely remove the Linux IO scheduler component. So in Linux, you can have basically pluggable IO schedulers that will do things like reordering requests as they're issued to disk drives to make accesses more sequential. It turns out if you have very fast random-access storage, that's not really beneficial, so you should remove that. So what we did first is set it to the NOOP scheduler, and this essentially takes in a request and immediately issues it again. But the problem is, even the NOOP scheduler puts all of these requests into a single queue, and then they get issued by one thread sequentially to the driver. So you've got a huge roadblock to parallelism that exists here. So, if we look at the graph on the right, you'll see a lot of graphs that look quite similar to this throughout the talk. The y-axis is bandwidth in either megabytes a second -- sometimes, it’s gigabytes a second, later. The x-axis is transfer size in kilobytes from 512 bytes up to about 512 kilobytes, and these are all random accesses. So in this case, we're doing reads. The blue line represents the performance of our system with the NOOP scheduler in place, and once we remove that IO scheduler and allow much greater levels of parallelism going into our driver and issuing multiple requests at the same time from a number of threads, our bandwidth obviously increases significantly, and that's what the red line represents. So we can actually chop off about 10% of the actual latency of the IO access and increase our bandwidth substantially with this optimization. >>: What's actually -- what's underneath what's happening there that gives you that increase? Is it because the device is capable of servicing multiple requests in parallel, or is it [inaudible]? >> Adrian Caulfield: Yes, so that's part of it. So we can get increased parallelism at the device level, and we're also able to essentially reduce the number of context switches that are happening as we're issuing requests. So if I have eight threads running in my application and they're all issuing IO requests, with the scheduler in place, each of those requests ends up inserting an element into a queue, and then a single kernel-level thread will pull items off of that and issue them through the driver. If you remove the scheduler, what ends up happening is that thread will actually go into the kernel, and then the thread itself will call a function in the device driver and hand off the request at that point. So we actually get a lot more threads in the kernel talking to the device driver, able to actually issue these requests. So you need the parallelism at the device level, but we also need to be able to get that somehow by allowing multiple application-level threads to talk to it. >>: Is the dynamic if you have multiple cores, you're parallelizing the work of talking to the device driver over cores? >> Adrian Caulfield: Yes. That's certainly... >>: In other words, if you had a single-core machine, would you see the same speedup? >> Adrian Caulfield: You would see some of it. Certainly, the latency reduction is still going to be there. When we do all of these latency measurements, it's obviously with one thread issuing a single stream of accesses, and that's because we're not context switching between threads to issue the request, so the application thread is actually the one that talks to the driver and issues the request down to the hardware. >>: How many concurrent access transactions can the device support? >> Adrian Caulfield: In this part of the talk, there's basically 64 threads, so we issue tags. We have 64 tags available. They're assigned in the driver to each request as it's issued, and the hardware can actually track 64 outstanding requests at the same time, as well. Later on, when we have a lot of applications talking to it independently, they each have their own set of 64 tags, and so there's a queue in the hardware that will basically get filled up by those, and then we can track 64 in-flight ones at the same time. Okay, so this is the first optimization that we can make. The next is actually selectively using spin waiting for smaller requests. So what we've found is that with very fast storage devices, it actually makes sense to hijack the thread that's issuing that request and sit in the kernel, spinning and waiting for notification that the request is completed. For our device, this works out to be about four kilobytes in size, so anything smaller than that, it's actually more expensive to switch to another thread, try and start doing some work and the have to switch back to complete the request at a later time. And so we've set it up so that, for small requests, we'll just spin in the kernel. For larger requests, we'll actually allow context switches to happen so that more work can get done. And so this change actually saves about five microseconds of latency, and there's a little bit of an efficiency tradeoff here, because we are actually wasting CPU cycles waiting for a request to complete. But we're also getting much better performance, and so maybe if you have one application that's performance critical, you'd rather have the five microseconds than an extra thread or something like that. >>: I sort of lost track, so how much is this -- we're in this five microseconds? You had it 97% overhead of software. How much of that percentage is going to account for it? >> Adrian Caulfield: I don't actually have it -- I don't have that broken down. >>: Well, there's a picture of it. >> Adrian Caulfield: Well, yes. >>: In the base column, the software is only about 50%. The 97% was with a much bigger denominator, and the dark blue bar between 15 and 20 is the wait bar, and it disappears. In the highlighted column, the blue bar disappears. >> Adrian Caulfield: Yes, so we've gone from about 18 microseconds or so of software latency down to 10 or something like that, with all three of these optimizations together. >>: How bad is the hit to the CPU? >> Adrian Caulfield: So a lot of the time that we're spinning, we would have been just context switching between different threads, so you sort of weren't going to get that much useful work done, anyway. Certainly, for the very small requests, that's true, but as the requests get larger, you sort of start to see this tradeoff. So the cutoff is around four kilobytes. >>: You chose that cutoff to optimize for the latency, right? Not to optimize for the amount of CPU overhead? Presumably the cutoff for CPU overhead would be smaller. >> Adrian Caulfield: Yes. I mean, it depends on whether you're actually -- it takes about two microseconds to do a context switch, so if you do that switch, you might get a microsecond of useful work done in your application before you have to switch back. Maybe 20%. >>: Okay. >>: So in that wait block that disappeared, what are we waiting for there? >> Adrian Caulfield: Basically, that's the amount of time that we're spending in the kernel waiting for a request to finish. >>: Okay, so the application thread waiting. So the idea is that instead of the application going to sleep, if the request is small, the application thread comes down, issues it and expects that the results will be back so quickly there's no point in even giving up the CPU. >> Adrian Caulfield: Yes. >>: Do you disable this optimization when the request queue is full? Because if you're going to have to wait for not only the four kilobyte read, but also a whole bunch of... >> Adrian Caulfield: So the actual implementation is not quite as efficient, but we'll basically allow it to wait for as long as a four-kilobyte request would generally take, and then, at that point, it'll actually go to sleep. >>: So you do the rent-to-own kind of thing. >> Adrian Caulfield: Yes, basically, it optimistically thinks it's going to be very quick, and then, if it turns out that will not be the case, we'll go back and let some work happen. >>: That's sort of the worst of both worlds though, right, because you've spun the CPU for a while and then still did a double context switch. >> Adrian Caulfield: Yes. Like I said, it's not the ideal case, and obviously we would... >>: It would be better to try to model how long it's going to take based on how many requests are queued and then make a decision. >> Adrian Caulfield: Yes. So one of the challenges there is that you would actually have to sort of keep a count of how many outstanding requests there are, and we've actually worked really hard to make it so that we don't need any locks in the driver. We don't have state that's shared across a bunch of threads because the cache misses actually take a significant amount of time. Yes. >>: I'm curious what this does to overall throughput. >> Adrian Caulfield: Well, since you asked. >>: I'm curious what happens when you push this slide at [inaudible]. >> Adrian Caulfield: That's right. So this is another graph with bandwidth on the y-axis and access I is on the x-axis. So the turquoise line at the top is our maximum PCI Express throughput in one direction. Clearly, we're never going to quite hit two gigabytes because there's overhead for PCI Express. But the blue line represents the baseline implementation, so we start off with something about like 20 megabytes a second for very small accesses, and eventually, at around 64-kilobyte requests, we can saturate our PCI Express bus. When we remove the IO scheduler, we shift this curve to the left fairly significantly. And as we continue adding our optimizations, this is removing a bunch of locking that happens in the kernel. So for requests that are about four kilobytes or bigger, this has a significant impact, but for much smaller requests, there's still a lot of overhead going through the kernel for accessing these data structures, mostly because we're context switching. When we add in the spin waits, we can actually recover a lot of that performance that we were losing. And so we've gone from about 20 megabytes a second to 500 megabytes a second through very small requests, and we get pretty good improvements for larger requests, as well, up to the point where we saturate the PCI Express bus. So if we had a faster connection, we could actually sort of continue these curves out to the right farther. >>: So the capital, your asymptote and your ideal is to the PCI Express. >> Adrian Caulfield: Yes. That's PCI Express overhead, the cost of issuing DMA requests and waiting for the responses from the memory controllers and things like that. There is a certain amount of overhead with PCI Express connections anyway that you just can't get rid of. Okay, so all of these accesses, we've done a pretty good job of removing a lot of the latency. We've significantly increased our concurrency so we can get better performance, but all of these accesses are to a raw device, so basically there's no file system sitting on top of a lot of these accesses, and it turns out that file system performance is actually fairly critical. So we have a system that looks like this, a bunch of applications running at the top, our kernel-level interfaces, Moneta underneath. The file system actually hurts our performance quite significantly, especially for writes, and this is because we're spending a lot of time updating metadata, and so we go from something about 1.5 gigabytes a second down to about 200 megabytes a second for write access, and this is with XFS on the Linux kernel. And so we need to find a way of actually addressing this performance discrepancy. How can we make our file systems as fast as the accesses that we can get to the raw device? So this is where the next couple of principles that we have come in, in refactoring critical software across our hardware, our operating system and the user space environment. And so in this case, we're going to refactor the file system and the sharing and protection information that needs to exist to make that work. We'll try really hard to not break compatibility with our applications, as well, so that we don't have to go and rewrite a bunch of software to take advantage of these systems. >>: Could you go back a step? >> Adrian Caulfield: Yes. >>: So your y-axis there, megabits per second, is that megabits per second of user data being written into the file system, or is that megabits per second of actual data going out to the storage device, including the overhead that the file system is adding? >> Adrian Caulfield: So this is user data. Yes. So we have a micro benchmark that essentially is using the write system call to update file data. Okay, so let's look at refactoring and recycling. We'll eliminate the file system and the operating system overheads. So we want to take a system like this. We have applications, kernel level in the middle. The red box denotes trusted code, so this is stuff that normally runs in the operating system and we're willing to say, okay, yes, it's right. It has access to everything. And then Moneta lives at the bottom. And it turns out that 41% of the total latency and 73% of the remaining software latency goes to supporting, protection and sharing in our devices. So this is like the file system cost with the operating system context switches and things like this. So we're going to take this picture, and we're going to split it into a couple of pieces. We'll maintain the application level at the top. We're going to shift some of the driver interfaces up into the application. We'll split the trusted code into two pieces, so that the file system and the IO stack live on the side, and the permissions checking block we're going to move down into the hardware. And the key here is that we're separating the protection mechanism from the policy. So on the left side, the kernel does both. It sets the policy and then it enforces it. On the right side, the kernel is still going to maintain the policy and decide who has access to what, but the permissions checking blocking the hardware is going to end up actually enforcing that policy. This allows us to actually give the applications direct access to the hardware so that we can eliminate operating system context switches and all of those costs. So to be able to do this, we need four pieces. The first is a virtualized interface to Moneta so that every application can sort of talk to their own device without having to compete with the other ones. >>: I have a question. >> Adrian Caulfield: Yes. >>: This seems really orthogonal to non-real-time memories, right? This thing could have been done even with classical storage. >> Adrian Caulfield: Yes, it could, but the performance gains are basically not going to be as useful. There's no -- if you're waiting seven milliseconds anyway, it doesn't really matter if you spend some time in the front. >>: Okay, got it. >> Adrian Caulfield: So we need the user space library that's going to sort of take over some of the functionality that we had in the kernel. We need our protection and enforcement block, and then there are some operating system and application-level implications as to what changes and what we need to be able to do to be able to actually make this system work. So the first piece here is Moneta Direct virtualized interface, and the key here is that we're virtualizing the interface and not the device. So all of the applications are seeing the same set of blocks, but they each have a separate set of registers and tags and their own communications channel with the device, essentially. And that channel contains a unique section of the address face exposed by the device, which contains a set of control registers. They have their own set of tags. They said we could track 64 tags per application earlier. We need some way of signaling interrupts back up to user space, and they have to have their own set of DMA buffers for transferring data back and forth between the applications, as well. So Moneta Direct actually supports 1,000 channels, and the point here is that we actually want all of the applications running on your system to be able to take advantage of this, not just one or two specialized applications. So it's really not a boutique interface. So the second piece is adding this user space library on top of -- or underneath our applications, and we're going to transparently intercept our file system calls, so if you normally would write read and write calls to a file descriptor, we’re going to intercept those and translate them into direct writes to the registers in the device and do all of the copying of data in this user space driver. So this library has to provide a couple of things that the operating system normally would have. So from the file system, we need to be able to translate file offsets into actual physical addresses that we can ask for data from the device, and we do that by retrieving and caching this information from the operating system. So the first time you access a block of data, you still have to make a system call, and during that time, we essentially say, okay, I want to access offset 100-and-some file. It’s going to return what physical blocks hold that data, and it's also going to send a message down to the hardware that says this application has permission to access this set of blocks. >>: So can files be shared across applications, or can directories be shared across applications in this mode? >> Adrian Caulfield: Yes, so they can. It gets a little tricky if you have applications that are using -- one application using this interface and one application using the operating system interface, just because the block cache tends to annoyingly get in the way, and so we have some coherency problems that you have to deal with, because writes will sit in that cache for 30 seconds before they actually show up in the data. Whereas applications using libMoneta see only the data that's in the hardware, because there's no cache. >>: Even if they're both using the same interface, if one of them writes, does the other get notified somehow? >> Adrian Caulfield: But it's not caching any of the actual file data, so if one of them writes, and the other one issues a read immediately after it, it's going to see the new data. >>: I can see that the file data is listed [at the] device, but it seems like the metadata -- like you extend the file or something, that there might be an inode that changes, so how does that get shared across applications? >> Adrian Caulfield: So, for example, if you're extending a file, those applications, one application is going to extend that file. The metadata is actually still managed by the kernel, so you're going to make a system call to do that extension, and the other applications don't know about whatever region was allocated, so when they asked for information about that file, they were given information about what currently existed. And so if they want to then read that new data, they have to ask the operating system to update the permissions entry and make that data available to them. So it does actually work, but if you're doing a lot of metadata-intensive updates, this is probably not the right interface for that kind of workload. This works really well if you have files that you're updating in place a lot or that have a lot of read and write traffic that update the file data but not the metadata. So from the file system, I said we need to be able to translate file offsets. We need to be able to essentially implement POSIX compatibility as best as we can. We get pretty close to full POSIX compatibility, but there are a bunch of interesting sort of rarely used features, like synchronizing file pointers between processes, that we don't necessarily implement. And then we also need some aspects of the driver to be able to talk to the device itself, and issuing complete requests. So the third piece that we need to update here is to be able to actually enforce protection in the hardware itself, and the key here is the file system is still setting the policy. Whenever an application wants to access a file, it asks the operating system to update a permissions entry and install it in the hardware so that that application will have permission to access it. The hardware is really just caching all of these protection entries that the kernel is generating, and we do that using a permissions table. This is an extents-based system. Every channel has its own set of mappings, and at the moment, we share 16,000 entries across all the channels. We found this is pretty good if you have one or more applications running. If you start running huge numbers of applications, you might actually start running out of entries. If this was an ASIC, we could make this memory a lot larger than we're able to with the FPGA-based design. Okay, so there's also a couple of operating system-level implications. With this system, we haven't had to make any changes to the file system. We don't have any changes to the applications, because we can just dynamically insert ourselves under the applications, but there are some open questions. Your question hinted at what happens if you have a bunch of applications running at the same time. It's fine as long as one of them isn't using the old interface and one is using the new interface. And file fragmentation can actually be a bit of an issue for this system, as well. Because we're using extents-based records, if you have a bunch of disjoint sections of a file, they're going to use a lot more entries in our permissions table than one contiguous region of memory would. >>: It's fairly dense here. How is it that you have no changes to the file system, given that you've taken the permission check out of the file system and put it into the hardware? >> Adrian Caulfield: So what we've done is essentially -- there is an interface that most file systems implement -- I mean, Linux, you can use it on any file system -- that will basically allow you to query and translate a file offset into a region of physical addresses in the storage device. And so what we've done is, basically, when you open a file through libMoneta, we query that interface. We can get the range of blocks that are represented or that are storing that data, and then you can just go and read and write that data without needing to communicate with the file system anymore. So we've already done the permissions check to make sure that you have access to open this file and to read and write from it, and then we've told the hardware that this application now has access to this region of data. >>: So there's a known interface by which you can query the state of the file system's permission table, essentially? >> Adrian Caulfield: Yes. It's more the actual layout of the data on the disk that we're interested in. The permissions table is also part of that, but it's sort of the smaller piece. >>: The open itself is what checks the permissions, and then once the open has passed, then the fact that you can query and get those extents back is what is authorizing [inaudible]. And then this is coming back to what you said about mixing interfaces. You can't mix interfaces, because if somebody is going through the file system interface, they might be changing the set of extents that are involved in the file. That's just summarizing the previous. >> Adrian Caulfield: Yes, essentially. I mean, the big problem with mixing the interface, it actually varies depending on which file system you're using and whether it's actually relocating data or something like that. But the big issue is that if I issue a write through the normal interface, it's actually going to sit in the cache for a little while. And once it's been read once, it's not actually going to go back to the disk and ask again, so you could have stale data sitting in the cache. >>: That's an issue with respect to consistency, but you also have an issue with respect to security, which is a little scarier, right? I mean, if the other interface frees some blocks, then the interface that's connected to libMoneta can now look at this extent that might get filled with new data from another application that's not using the Moneta interface, and it shouldn't be seeing that data. >> Adrian Caulfield: Yes. >>: Does the current system -- is that just a security flaw in the current design? >> Adrian Caulfield: A little bit, and we actually have some future work that is looking at actually using some caching, using the same interfaces for caching data and using Moneta as a cache. So with that, we have a way of shooting down the permissions entries. I mean, certainly, we could insert hooks in the kernel to remove the entries in the hardware before you move data around, and that would solve this problem. It's not in the actual implementation for this work, but it's certainly solvable. Yes. >>: Would this work for something like a journaling file system, where the mapping of file offset to a visible block is not consistent? >> Adrian Caulfield: So as long as you're not moving those extents around underneath, yes. So if another application is updating the data and moving it, writing into the journal, then those extents are going to be moving and we're going to have to keep refreshing what permissions entries are installed for the other applications. But once the data is on disk, it's not going to move until the file system gets another update or you sort of write at the end of the log with the same file offsets. >>: Sorry. I wasn't talking about the permissions. I can see how this works for reading, that for every offset there is a block dependent on this [inaudible]. But for writing, the translation of where you're trying to write to the physical block is not something that you can predict in advance, and so it doesn't seem like this would work with a journal. >> Adrian Caulfield: So it doesn't maintain the journaling aspect of the file system, but you can go and update the data in place, right? If you've already written that block of data, it has a physical location on the disk, right? >>: But the file system isn't going to expect you to be writing it there, though. It's going to be expecting you to write in the journal. >>: On the journal, are you talking about a log-structured file system, actually? >>: Sorry, yes. >>: So he's not talking about journaling in a conventional file system but a log-structured file system. Communication is actually [inaudible]. >>: Would you be allowed to do that versus just the [inaudible]. >>: That's right, and where the data is located. So what Jeremy is saying is that in a logstructured file system, the data moves. >> Adrian Caulfield: And so what I'm saying is, as long as -- right. We're basically breaking the log-structured aspect of the file system. If you update the data through libMoneta, we can find where that data is currently living in the device, and we can update it. >>: I guess maybe I missed what your model is. In other words, are you creating a new file system or are you creating a new interface to existing file systems? >> Adrian Caulfield: We're creating a new interface to existing file systems. >>: But if you're updating -- if you have a log-structured file system that expects a write to happen at the end, if you just substitute it randomly into the middle, then if you later try to read it using the traditional file system code that's sitting in the kernel, it's not going to know where to find it, it seems. >> Adrian Caulfield: No. It's going to be exactly where it would have looked for it before. As long as it's not in the buffer cache already, it's going to go and read whatever the latest copy of the data is. If you do a read in a log-structured file system... >>: Can I give you an alternative answer? >>: Yes. >>: Log-structured file systems are there to reduce to seeks and write time. >>: I understand that they're not just there -- we're trying to figure out if they're correct. >>: Yes. >>: If even in SSDs you have a reason to use log-structured file systems, can you avoid overwrite in there? I mean, that's the portion I have been waiting to hear, if you are doing anything on the [inaudible] mapping. >> Adrian Caulfield: I mean, devices like Moneta don't actually need an FTL. The memories have very simple wear management requirements, so we can actually get away without needing to store a map for the whole device. >>: Or is that because you assume that type of memory you are using is PCM, basically these types of devices? >> Adrian Caulfield: Yes. And so a system like this would work for flash, but the latency difference is perhaps not enough to sort of require that we start moving towards different interfaces to the storage. >>: And I have other questions. >> Adrian Caulfield: Yes. >>: If you are talking about basically a PCM-based system -- but let's continue this. >>: Before John pointed out that the question was the moot because you wouldn't use the logstructured file system here, you were about to answer the question anyway, and I was about to gain some understanding about how your system works. Can you just finish answering the question of what would happen if you wrapped a log-structured file system with this? >> Adrian Caulfield: Right, and so the short answer is, it will work, but you're sort of losing the log-structured semantic of the file system. If you write data in a log-structured file system, normally, it's going to change where that block is located on the disk and it's going to write it at the end of the log and update a map, where it says, here is where the latest copy of this data is in my journal, or in the log. And so when we do a read from that, you ask the file system, hey, where's this data? It's going to tell you that it lives at the end of the journal, at location five, or something like that. And so we can get that same information, and when we do a write, we can go and update the journal itself -- perhaps not the last entry in the journal, but somewhere in the journal this data exists, and it's the most recent copy, so we can go and update that data in place. >>: Oh, okay. What I wanted to find out was what happens during a cleaning operation, but I guess it just stops working. >> Adrian Caulfield: Yes. So if you move the block of data, for those file systems... >>: If you open a file and then the cleaner comes along and moves the blocks, you're hosed, right? >> Adrian Caulfield: Yes, so we don't cache the whole set of extents for a file as soon as we open it. It's sort of on demand, so we ask for the extent that contains whatever offset you are trying to access. And so what ends up happening in that case is you'll remove the permissions entry for that extent, because the data isn't there anymore. The user space library will try and perform that access and then it'll have to ask for permission again, and it will tell you where the data is located. It could be made a little bit more efficient by having the right hooks in the file system to essentially preemptively do that, but we don't do that at the moment. Okay, so let's look at what the performance difference is, or what performance impacts we can have from a system like Moneta Direct. Again, this is a same kind of graph, bandwidth on the yaxis, access size on the x-axis. The blue line represents the file system interface performance, so if we're accessing data through the file system, performing a bunch of writes, then we get that. The green line represents the raw device-level performance. This purple line represents the performance of accessing our device -- a raw device -- through the user space level interface. So you can think of this as a file system that has one extent that covers the entire device. And so basically, we open the device and we can read and write to it. You have full access to it. And what we'd like to see when we add the file system back on top of this is that the performance will be as close to this purple line as possible. Yes. >>: The user space one is doing better because the kernel space one is one where the user has some data, gives it to the kernel and then the kernel gives it to the device? >> Adrian Caulfield: Yes. >>: And the user space one is just more directly through this? >> Adrian Caulfield: Yes. So, basically, this is the cost of the context switch and the minimal does he have access to this raw block device. So there's a bit of performance improvement there. We go from about 500 to 850 or 900 megabytes a second, and so we'd like to see the file system level performance be as close to this as possible, as well. And so that's actually what we get. So, basically, we've completely eliminated the file system overhead. There's a one-time cost when you open a file to sort of read in the extents the first time you access that block of data, but after that, we can just read and write to the file in place, and we don't have to go through the operating system or the file system at all. And so we've gone from the 1 million IOPS we had before for small accesses up to about 1.7, so this is a pretty significant increase in performance. So another thing, we've worked pretty hard to try and maintain the same interface to our file system you don't have to change your applications at all, but sometimes it can actually be a little bit beneficial to go and rewrite your applications to take advantage of new features that your storage devices have to offer. And so libMoneta actually provides an asynchronous interface to the device, as well, so instead of capturing read and write calls and intercepting them and translating them into accesses, we can go and modify the applications to actually perform asynchronous operations. We'll start a request, we'll go do some useful work, and then we can wait for it to finish later. So this graph at the top, the red line represents the synchronous one-thread performance. So this is one thread doing random accesses. The dark-blue line is one thread using our asynchronous interface, doing random accesses, and the turquoise line is eight threads using the asynchronous interface. And so we can get about a 3X performance improvement for 32-kilobyte requests using our asynchronous interface. And we also changed the ADPCM decode stage from MediaBench that's basically doing some audio processing to use this asynchronous interface so we can be loading in the next block, processing it and writing out the previously processed block at the same time. And we can get about a 1.4X gain doing that, as well, versus using our user interface directly. >>: What's it stand for, for ADPCM? >> Adrian Caulfield: Sorry, what's the scale? >>: What does it stand for, for the ADPCM? Basically, it's a DPCM recorder? >> Adrian Caulfield: Yes. We're basically decoding PCM audio, I think. >>: But why are you showing it here, basically? You are showing audio files for the ADPCM format on the files, and then you try to read and decode it, send it to audio interface? >> Adrian Caulfield: So we're basically converting it from the PCM encoding to another format and writing it back to disk. >>: And you are showing the speedup in terms of you [ph] can move this data. >> Adrian Caulfield: Yes. Basically, so if you have a fixed-size file, say it's 100 megabytes, you process it. It takes a certain amount of time with the synchronous interface, and with the asynchronous one, we can do it 1.4 times faster. So, basically, we have one interface and we -okay, sorry. We have this benchmark written using read and write system calls to essentially load in a buffer, process it and then write that out and start processing the next one, and so that's the baseline. And then the improved version does the reading of the next piece asynchronously, and then it processes the currently loaded chunk and it will write out the previously processed one at the same time, using asynchronous calls, as well. >>: What's the purpose here? You are demonstrating this benchmark. You are showing the improvement using the asynchronous interface that you can basically issue a request and wait for it -- basically just ping and wait for the data to come in? >> Adrian Caulfield: Well, so basically, I'm trying to show that if we're willing to change the interfaces that we write our applications with, we can actually get some fairly significant performance improvements, as well. So libMoneta does a really good job of speeding up normal read and write calls by translating those into accesses directly to the device, but if we're willing to use an asynchronous interface, we can actually sort of start pre-fetching data before we need to actually use it, and we can then do useful work while the data transfers are actually happening, versus the sort of normal, synchronous interface. >>: I saw it probably for your basically demo products, a database-like application for a [inaudible] probably is most interesting. The main reason is this. All media applications, media encoded and decoded applications, can be heavily pipelined. For instance, ADPCM audio decoding can basically just read a bunch of ADPCM modules, decode, play it out. I mean, there is no reason basically just as a small block in the makeup. >> Adrian Caulfield: Sure. I mean, it's just a little example, basically, to say -- the existing benchmark in MediaBench has done this as efficiently as possible using a synchronous interface. It's already pipelined and it's doing the right things to get good performance, but if we change the interface, we can actually do slightly better. Certainly, other applications are going to have different benefits and costs and things like that. >>: Is the performance improvement you're showing in your graph application specific? >> Adrian Caulfield: So this is just -- again, it's a little micro-benchmark, so it probably depends a lot on what processing you're actually doing, but basically, we can actually issue more IO requests with one thread. And so if you have a thread that can generate a lot of IO before you need it, then you can get this benefit. >>: I'm a little confused how to square this result with the one from 12 slides ago, where you said that waiting is a bad idea because the overhead of setting up the wait is longer than we expect to wait for the request to come back. That suggested that we want to be synchronous and not asynchronous, but here you're suggesting the opposite. >> Adrian Caulfield: I don't think that they're incompatible. Basically, if we have a large number of threads, it makes sense to allow some of those to wait for IO requests, but again, that's in the synchronous, I'm going into the kernel and I'm paying a pretty heavy context switch cost to essentially get into the kernel and issue my request. And so the penalty there is when you have to context switch back to other threads, right, there's a large overhead. In this case, we have one user space-level thread that's going to issue a bunch of requests and potentially be able to do useful work, without ever paying any of those context switch costs. So we're getting microseconds of time back here that we would have just otherwise been spending switching into the kernel and coming back. It depends on what work you're trying to accomplish while your accesses are happening, as well. Certainly, if you're reading tons and tons of data and you know that you're going to need it a second before you actually do, you should do that asynchronously and then do your work, and then the data will be ready by the time you're actually ready to access it. >>: One of those lines is [inaudible]. >> Adrian Caulfield: It's A cores [ph]. >>: A cores [ph], okay. I mean, you could also have workflows that use multiple threads instead of asynchronous operations. If you did that, would you want to turn off the optimization from the previous slide? >> Adrian Caulfield: Of? >>: Of waiting rather than [inaudible]? >> Adrian Caulfield: It depends. A lot of these things depend on exactly what workload you're running. So if you have a large number of threads, if each of those threads could actually be issuing a number of IO requests at the same time, it makes sense to have as much work as possible for the storage device to do in the queue, ready, already on the device. So if you can actually generate those requests and then continue doing useful work, asynchronous interfaces make sense. If you can't make any progress until the data comes back, you really want the lowest latency possible, and at that point, spinning and saying -- so that you're ready as soon as the data comes back is probably the better choice. I don't think it actually really matters how many threads you have. For each thread, you sort of have to make that decision of do I want this request to process as soon as the data is done, or am I okay waiting a little bit of extra time but having 10 requests complete and the data available for those. That's kind of the tradeoff. Okay, so this is sort of the optimizations for local storage. We've also looked at how we can extend Moneta-like devices out onto the network and attach multiple devices together and form distributed storage solutions. And so we've applied our three principles to distributed storage, as well, and essentially we're looking at how we can reduce block transfer costs, how much extra time does it take to actually go from one device to another one, things like Fibre Channel and iSCSI are sort of the examples of what I'm talking about here. We can also refactor other features into hardware, if you wanted to, like replication, things like that, and we want to be able to make sure that we can continue using the existing shared-disk file systems that are out there, continue allowing use of our user space access libraries and things like that. Sorry, so basically we go from a situation where this, where we've done a very good job of decreasing the latency for our local storage requests, but then we add in some sort of software stack for sharing devices out over the network, like iSCSI, and we've basically gone back to squandering our performance gains that our memory technologies have to offer, and so we want to be able to do something about this. And so to study this, we essentially took a bunch of Moneta Direct devices, connect them all to a network, and now we have a situation where there's a large amount of storage distributed throughout our network, with a little bit of storage attached to each node. So we have the opportunity to take advantage of data locality, as well, so if the applications are aware of what data is stored on what nodes, we can take advantage of that at the same time. To build this device, we take our existing Moneta architecture and add a network interface to it. And so now whenever data is in the transfer buffers, we can actually just generate a packet and send that to another device or vice versa. Requests can come in over the network and we can process them locally. And so we have a very low-latency interface to the network from within our storage device. And we also get to take advantage of the permissions checking blocks and the virtualization components that already exist, so we can actually do the permissions checks before the requests go out over the network. So the first thing we looked at here was reducing the block transport costs, and so these are sort of the protocol and physical layer costs for accessing storage remotely, so this is independent of what it costs to actually access the memory. It's independent of the operating system and the file system overheads. It's essentially you take a disk and you attach a network interface to it, how much extra time does it take to use that network interface versus attaching the disk locally? So with QuickSAN, we are able to use raw Ethernet frames. We have extremely minimal protocol overhead. We're essentially just sending the destination addresses and about eight bytes of data with each request to signal where it comes from, and we're taking advantage of flow control at the Ethernet level to make sure that we have a reliable network, so we don't have to deal with retransmitting packets and things like this. And so we can go from something like iSCSI, which has a latency of about 280 microseconds, down to about 17 with QuickSAN. And for comparison, Fibre Channel adds about 86 microseconds of latency overhead, and this is doing all of the packet generation and processing and hardware. >>: Which side is running the permissions check? >> Adrian Caulfield: So the local storage side in this case is actually handling the permissions. So you have to trust the network in that you're basically on a... >>: So local, you mean the guy who is issuing the read-write requests, not where the bits are located. >> Adrian Caulfield: Correct. And you could do it both ways. This way, if you don't have permission, you find out much sooner, and if you have a shared disk anyway, like they all see the same set of permissions and metadata. >>: I'm confused by the Ethernet flow control thing, because I remember, like, there's the physical there, that's Ethernet flow control. It's typically [inaudible]. What's Ethernet flow control? >> Adrian Caulfield: So it's part of the 802.3 standard. Basically, there's a mechanism that NICs and switches can basically send each other packet-like pause frames that say don't send me anything else for X number of seconds or microseconds or so. And so it's actually -- it's not as commonly used, but other storage area network technologies do essentially the same thing. They require reliable networks, and so they have some form of flow control built in. >>: There are reasons to drop packets other than full queues, but you said you weren't do any packet retransmission, so is that up at a higher layer, or it happens there? >> Adrian Caulfield: So in the systems that we've built, basically, we have a dedicated network, and so we have not had to deal with any sort of packet loss in our layers. But, basically, we could deal with it by essentially detecting if the request doesn't actually complete and reissuing it at the software layer or something like that. But because we have sort of guarantees of how much buffer space is available... >>: But I was talking about packet losses not related to full queues at all, like a checksum that fails or something. >> Adrian Caulfield: Yes, like I say, we haven't run into that particular problem. I think the right way to do it with this solution would be to actually have timeouts that say if I don't get a response back in a certain amount of time, issue it again. But we were trying to minimize acts that we have to send around and keeping data around so that we can reuse the buffer space for other requests. >>: How do the other systems, the iSCSI and Fibre Channel, deal with that? Do they do retransmission? >> Adrian Caulfield: I'm not 100% sure on Fibre Channel, but iSCSI runs on top of TCP/IP, so they've sort of got the reliable communications channel underneath. >>: Can you comment on the work of [inaudible] RDMA or converged Internet, and basically how IO work [inaudible]. Does that share some similarity with that work? >> Adrian Caulfield: Yes. There are some similarities with RDMA. We're doing it a little bit different. It's remote, direct storage access instead of remote direct memory accesses, so we're not actually putting data into RAM on another machine. It's going through the SSDs and things. The permissions-checking pieces are certainly not there in RDMA and things like that. And as far as converged Ethernet goes, this is sort of moving in the same direction, right? We want to be able to use networks for both normal Internet traffic, as well as storage traffic. One of the pieces of that is having the ability to support flow control in part of that channel and things like that. But because we have dedicated network interfaces to the storage, we're not really dealing with other forms of traffic at the same time. Okay, so let's see what the performance improvements we can get with devices like this are. So if we're comparing against an iSCSI stack here, and essentially, iSCSI performance is pretty poor because of those really high software overheads that I talked about. Fibre Channel would be somewhere in between these two numbers. Basically, because QuickSAN has this much lower protocol overhead, we can get fairly significant performance improvements, as well. And we also compare this against a centralized version of this configuration. So this is a more typical SAN setup, where you have one large server in the middle that has a bunch of storage attached, and then you have a number of clients that are sharing that storage. So we can model that by having a bunch of Moneta boxes or QuickSAN SSDs connected to a single storage server, and they're all connected to the network, as well. And then we have one or more client interfaces that are using the same device, but were not touching the storage or sharing the storage attached to that device. And so if we look at this, we get about a 27X improvement compared to the 32 or so that we were getting with the distributed case. And this is because some of the data was available locally in the other configuration, and so we actually get a locality benefit. Some of the accesses that you're performing don't actually have to go out over the network. iSCSI is pretty bad in both cases, and that's because of the software overheads. There's not really a huge benefit to moving the data out into the network, at least in terms of raw throughput for completely random accesses. So if we look at a workload running on top of this, we've implemented a distributed sort, an external sort, so this is often used by systems like MapReduce and other large data sets, and this is based on TritonSort, which was developed at UCSD by George Porter. And, basically, distributed sorts work in two phases. The first step partitions your data, which starts out completely random on all of your nodes. We're going to partition that into groups of ranges of key values stored on each node. So if you have four nodes, we split the key space into four spaces and you move the relevant pieces onto the other storage devices, like the storage devices that handle that set of the key space. So that part uses the storage area network significantly, because we're essentially writing data from every node to every other node. Once the partitioning is done, we can sort the data, and this happens locally on each of the nodes, so wherever that range of the key space is, we end up with an assorted range of keys. So QuickSAN can actually leverage the nonuniform access latencies, because we have some of this storage distributed out into the network, and this gives us some pretty significant performance improvements. So comparing iSCSI to QuickSAN and the centralized and distributed versions, so this is the latency for sorting 102 gigabytes of data spread across, I believe, eight nodes. So it takes about 850 seconds to do this sort with iSCSI in the centralized form. We get about a 2X speedup from using distributed, and so in this case iSCSI is actually benefiting from having a lot of the data locally, as well. And QuickSAN does about three times better than the iSCSI implementation on this particular workload. And we get a 2X speedup from using the centralized versus the distributed cases. >>: How much purchase [ph] are you using and what are the Ethernet interfaces in these? >> Adrian Caulfield: Yes, so the storage, the Ethernet interface is 10-gig Ethernet. They're Nehalmen machines, I believe, eight Nehalem machines. >>: It seems -- I guess this is a follow up to Jim's question about your audio encoder. This seems like a really bad application benchmark, because this is a classic big data application where you know exactly what data you're going to read. You don't have to issue small writes or reads. You can issue huge ones and you can do them asynchronously all the way out the end of the file the moment the app starts. There's no -- in other words, chasing after latency and trying to reduce the latency of a request only matters if you're blocked until the request finishes, but that's not the case at all with storage. With storage, you know exactly what you're going to be reading out to quite some time, so there's no point in doing small reads and writes. You may as well do the large ones. >> Adrian Caulfield: That's true, and that's one of the reasons that the iSCSI implementation is actually not that much slower than QuickSAN one, right, because we're actually able to do these large data transfers. We've looked at other applications, as well, and clearly, if you're doing large random accesses or small random accesses all the time, a device like QuickSAN is going to be much better than the equivalent iSCSI. >>: Right. I'm surprised that iSCSI is slower at all, because as long as you have enough requests outstanding, the network is always busy, or the underlying device is always busy, whichever is slower, and it doesn't really matter what the latency of the underlying requests are. It just matters -- all you care about is the utilization, the throughput, and not the latency in the underlying requests. >> Adrian Caulfield: Yes. >>: So is it that iSCSI for some reason is underutilizing the channel there, or what's happening? >> Adrian Caulfield: It's possible. We're reading at least a megabyte a time in this workload, and so the overhead that iSCSI has is actually just a lot of the software stack that exists. So even though we're using it in an ideal case for this workload, there's still... >>: When the overhead is high, are you saying then that the whole thing is CPU bound, or that you're spending much of your time processing iSCSI overhead on a CPU? >> Adrian Caulfield: I'm not actually sure what the breakdown was in CPU usage for this. I imagine that there is a fair amount of overhead, but basically, because you're sort of doing a lot of operating system context switches -- the iSCSI stack is actually pretty involved. If you look at how these devices work, if you trace the flow through... >>: But that shouldn't matter unless the CPU is the bottleneck resource. In other words, you can do as many context switches as you like if what you're waiting for is for big, giant, bulk transfers to come over the network or something. So I'm trying to understand if you're making an argument here that the bottleneck resource was the CPU, and so reducing CPU overhead sped it up, or is something else happening? >> Adrian Caulfield: I think we had available CPU cores, so I don't know that actual amount of processing power was a problem. It's possible that the one thread or several threads that are doing these transactions are actually not able to get through all of those layers of protocol overhead fast enough. I mean, I imagine that's probably the case, but basically, when you issue an IO request or an iSCSI request, it goes down into the kernel. It's like a virtual block device, and then there's a user space iSCSI daemon that runs, so you end up going back into user space, and then you go back down. >>: So let's say that takes a second, so just do a lot of them. >> Adrian Caulfield: Sure. But in that case, then, we clearly don't have enough CPUs to have a second of latency absorbed by that many threads. So that's probably where... >>: So if it's literally a second a core, I agree, but you said you're not CPU bound, which suggests you can keep launching threads and do more than the parallel. >>: I think the short comment of Jeremy's comment is sort is a heavy parallelizable application. Considering the time that... >> Adrian Caulfield: Okay. So let me conclude the Moneta section here. Essentially, emerging non-volatile memory technologies offer huge potential for extracting more knowledge out of the data that we've been able to acquire over the last years, but it's going to require better system design to be able to do this effectively. And the key here is we have to prevent squandering a lot of the performance overheads that these technologies offer. And software overhead are becoming the critical piece in that puzzle at the moment. So I've shown you these three principles, reducing software overheads, refactoring pieces across all of the system stack, and recycling pieces of that stack that we can, like not rewriting our file systems and things like that. So we've been able to reduce overheads by about 84%. We can get 30X higher 512-byte IOPS. I'm going from 20 megabytes a second to 500 megabytes a second for those transfers. And I've shown you that distributing storage out into the SAN is perhaps a good idea, and that we can build these very low-latency interfaces to network storage, as well. So let me just give you a quick overview of the timeline of the Moneta work. So in 2010, we had our first prototypes. We had papers in Supercomputer and Micro. I was the lead student on this project and basically designed all of the infrastructure, built about 60% of the hardware and all of the software stack to make the system work. In 2011, we released Onyx, which is an actual phase-change memory SSD prototype built on top of Moneta, so one of the students in our lab built a DIM with a bunch of phase-change memory on it and we dropped those into the B3 boards. Performance is not nearly close -- not that close to the projected PCM latencies, but it's first-generation PCM. 2012 was the Moneta Direct work, so we’re doing direct user space access at that point, and then in this year, in ISCA, we'll be showing off QuickSAN more. I've also worked on a couple of other projects at UCSD, so we had one of the first sort of wimpy data-centric computing nodes in Gordon. This was actually published in ASPLOS of 2009 and selected for Micro Top Picks, as well. I've looked a little bit about NV heaps, which is looking at how we can attach non-volatile memories to the DRAM bus, and essentially how do we implement a transactional memory interface to sort of protect those technologies and make them accessible and easier to program. And I've also built several hardware platforms for doing flash memory characterization and SSD research, as well. But that was back in 2010 or so. So I'd also like to look a little bit about a couple of future directions that we can move storage research. So one of these is whole system optimization, the Moneta-style work. I'd also like to look at how we can make compilers smarter at actually interfacing with storage and what we can do to get more optimizations automatically by adding features to the compilers. Also, I think there's a lot more to be done looking at storage and networking. The first piece here is system-wide optimization. So Moneta-style research, where we're building hardware and tuning all of the layers of the software stack on top of that are kind of the things that I like to work on. And I think as fast storage systems become more available, it's going to drive changes in other areas of the system stack, as well. Things other than storage. And I think this style of work is a good fit for looking at a number of problems. I'd also like to look at storage-aware compilers. So I think that this cartoon at the bottom gives you an idea of what I'm talking about. At the moment, when you write an application, you compile it, you use read and write system calls, those are essentially black boxes that the compiler can't reason it out, they can't do anything about. And storage actually has some pretty well-defined semantics, and so we should be able to promote storage up into the realm of things that compilers can reason about. If we have a well-defined interface like libMoneta, we can actually start having compilers automatically convert calls from synchronous interfaces to asynchronous ones or to eliminate redundant read and write calls to the operating system and things like that. And I'd also like to look more at storage and networking. So QuickSAN is really taking the first couple of steps at how do we attach really low-latency storage to networks and what are the right interfaces for those? But I think there's a lot of broader system design questions. We're not really looking at what synchronization primitives are we supporting for distributed file systems to keep things in sync. We're still using the same locks that we had before. Things like replication consistency concerns are problems, as well, and I think we need better interfaces and protocols between nodes to really be able to take advantage of these technologies. So thank you, and I'd be happy to take any more questions that you guys have. [applause] >>: Any quick questions. >>: Where does the name come from? >> Adrian Caulfield: Moneta? So it's actually the Roman god of storage or something, or goddess of storage or something. If you on the NVSL webpage, there's a picture of a coin or something that somebody has that has Moneta on it. [applause]

Document 17881921

Related documents

Products

Support

Document 17881921

Related documents

Add this document to collection(s)

Add this document to saved

Suggest us how to improve StudyLib