Document 17881921

advertisement
>> Chris Hawblitzel: All right, so it's my pleasure to welcome Adrian Caulfield. Adrian got his
bachelor's degree here in Seattle here at the U-W, and he's getting his PhD from UC San Diego,
advised by Steven Swanson, and he'll be talking about designing or redesigning storage systems
for fast non-volatile memories.
>> Adrian Caulfield: Thank you for the introduction, and thank you for inviting me up to
interview here at MSR. So over the last few years, in the Non-volatile Systems Lab, we've been
working on integrating emerging non-volatile memory technologies like phase-change and spintorque transfer memories into systems.
And what we found is that as these technologies drive latencies from seven milliseconds or so,
like we would have with disk drives, down to a couple of microseconds, the software overheads
that we experience are actually skyrocketing. So we go from about 1% software overhead going
through the kernel stack all the way up to about 97% with these very fast non-volatile memories.
And so I'd like to first set the stage a little bit with some sort of background on where storage
systems are going, why this is an interesting problem, and then I'll walk through a couple of
different iterations of the prototype SSD storage system, Moneta, that we've created. And then
I'll talk a little bit about what direction I think some storage research should be going and some
future ideas that we could look at.
So we're really living in the data age now. The world's collecting data at an astounding rate. I'll
just give you an example of how this is going. In 2008, we processed about nine zettabytes of
data. So this is 10 to the 21st. It's a phenomenally large number, and we have huge scientific
applications that are generating data at very fast rates. The Large Hadron Collider, for example,
generates terabytes of data with every experiment they run. Large astronomical surveys are
doing nightly sky surveys that generate petabytes of data, and we have to be able to process this
information.
Websites like Bing, Google, YouTube, Facebook, are all collecting lots of user-generated
content, as well as large indexes of the web, and so we really need to be able to start extracting a
lot of knowledge from this data that we're collecting, and it turns out that storage performance is
one of the major bottlenecks holding us back from being able to do this. But new storage
technologies like phase-change memory and spin-torque transfer memories can actually help
solve this problem, as long as we're careful to not squander the performance that they offer.
So if we look at the trends in storage technologies over the last couple of years, starting with
hard disk drives -- these numbers are for an array of four hard disks, but we can get latencies of
around seven milliseconds. Random access bandwidth reading four kilobytes of data at a time
gives us bandwidth of around 2.5 megabytes a second. This was sort of the case for the last four
decades or so, up until about 2007, which saw the introduction of flash-based PCI Express SSDs.
And these devices significantly decreased latencies to around 58 microseconds. They've
increased bandwidth significantly, to about 250 megabytes a second, so this is about a 100X
improvement overnight from what we had with hard disk drives.
And if we continue down this road, devices like I'll talk about today, which might be
commercially available around 2016, have latencies of around 11 microseconds. Bandwidth
goes up to about 1.7 gigabytes a second, mostly constrained by the interconnect that we're using.
And so we can get 650X improvements for both of these, latency and bandwidth.
>>: Just a quick one. So PCIe flash, is that the same as an SSD?
>> Adrian Caulfield: Yes, so this is an SSD attached to the PCI Express bus, so think Fusion-io
or something like that.
>>: This feels a little misleading to me, because if you're really doing big data, you're not
reading it 4K at a time, and on sequential reads and writes, there's only about a 2X difference
between PCIe and a hard drive.
>> Adrian Caulfield: Sure. So for sequential accesses, the gap is smaller, but these new
technologies are still going to give us significant improvements. For random accesses, which a
lot of applications require, especially if you have very large data sets that you need to sort of
query and pull various bits of information out of, these numbers are certainly thins that we've
measured in the labs with workloads that we have. So, depending on your workload, they're
going to change.
So if we do the math here, between 2007 and 2016, it works out to about 2X a year in terms of
performance improvements for both latency and bandwidth, and so this kind of scaling is
actually better than what we've seen with Moore's law with CPU performance improvements
during its peak. So the types of memories that I'm talking about are faster-than-flash nonvolatile memories. These are devices that have interfaces that are as fast as DRAM or nearly so,
maybe with a factor of two or three off.
>>: Can you go back one slide? This all seems great. There must be some downside to all this,
right?
>> Adrian Caulfield: Yes.
>>: What is it?
>> Adrian Caulfield: So I'm going to get to that, but one of the big downsides here is that
software overheads are actually going to limit the performance that we can get from these
devices, unless we do something about it. And so I'm going to walk through some of the ways
that we've been able to tackle that problem with our prototype SSDs.
So the memory technologies that I'm looking at are things that are as fast as DRAM. They're as
dense as flash memory, or they will be soon. They're non-volatile, they're reliable, and they have
fairly simple management requirements. We don't need a large, thick management layer like you
would with flash memory or something like that. So phase-change memory, spin-torque
MRAMs and the memristor are all pretty good examples of the kinds of memory technologies
that we're looking at, and at least one of these is going to be commercially available within a few
years at the performance characteristics that we're looking at.
But the challenge here is that the relative cost of the software overheads that we have on top of
these devices are actually staying roughly constant, or at least they are at the moment. So this
graph on the y-axis shows you latency on a log scale in microseconds, and then for disks, flash,
and our fast non-volatile memories, we've broken down latencies into file system overheads,
operating system overheads. If we wanted to share these devices over a network with something
like iSCSI, that overhead's there. And then the red bar represents the actual hardware latency for
these devices.
And so as we go from disk drives, we have a situation where the hardware overheads are about
two orders of magnitude larger than the software overheads that we're experiencing, to
something like flash, where we're actually fairly well balanced. And when we get to fast nonvolatile memories, the hardware latencies are actually about two orders of magnitude less than
the software overheads that we're experiencing. And so this works out to something like 4%
software overhead for disk, all the way up to about 97% software overhead for our fast nonvolatile memories.
>>: That 97% has iSCSI in the denominator.
>> Adrian Caulfield: It does, but even if you remove the iSCSI, the software overheads are still
pretty high. And it's also a lot of the software limits the parallelism that we can get from these
devices, as well. The performance is more than just the latency numbers.
So to help us understand this problem a little bit better, we built a system called Moneta. The
sort of cartoon representation of the application stack looks like this. Applications run up at the
top in user space. We have a file system, Linux IO stack in the middle, Moneta device driver
underneath that, knows how to talk to our device and has some optimizations included. And then
this all runs on a sort of standard X86, 64-bit host machine, and we have a PCI Express
connection connecting this device to the host machine and several banks of non-volatile memory
technology inside Moneta itself.
>>: So Moneta is a piece of custom hardware you guys have?
>> Adrian Caulfield: Yes. It's all an FPGA-based prototype, and I'll go over what this looks
like. So the Moneta architecture runs on an FPGA board, and just to sort of explain this, I'll walk
through a read request as it sort of travels through the hardware. So, once it's issued by the
driver, the request is going to show up as a PIO write to a PCI Express register. It goes through a
virtualization component, which allows us to have the appearance of many independent
channels, so the applications can each talk to them. Once it's gone through that, we have a
permissions-checking block so that we can actually verify that applications that are issuing
requests are allowed to make the requests that they're generating. And from there, the request
will get placed into a queue. Once the space is available in our scoreboard, we'll allocate space
there and allocate space in our transfer buffers for the request, as well. We can track 64 in-flight
operations in the scoreboard at the same time. Since this is a read request, we'll send out a
message across our ring network to one or more of the memory controllers, and they'll send the
data back to the transfer buffers.
Once it's there, we'll issue a DMA request out to the host machine and the data will show up in
the DMA buffer allocated by the operating system and we can complete the request there. We'll
set a bit in the status registers, and we can issue an interrupt to notify the operating system that
the request has been completed. So it's a fairly straightforward architecture. We're trying to just
move requests through this as fast as possible and keep as many of the memory controllers as
busy as we can.
This whole design runs on top of the B3 FPGA board, and so this is actually designed in part by
Microsoft Research. It uses a PCI Express 1.1 8X connection to the host machine. This gives us
two gigabytes a second of bandwidth in each direction. The whole thing runs at 250 megahertz,
and we're actually using DDR2 memory to emulate the non-volatile memories that we're looking
at, and we can adjust the RAS and CAS delays and the precharge latencies that the memory
controllers are inserting to match the PCI latency projections that we want to match. So we're
using projections from ISCA of 2009, which said PCI latencies will be around 48 nanoseconds
for reads and 150 nanoseconds for writes. And this board actually has 10-gigabit Ethernet
connectivity, as well, and we'll use that to sort of look at how we can extend storage out onto the
network, as well, a little bit later.
So, as we've been developing Moneta, we've come up with sort of three principles to help us
guide the development of these hardware devices and keep software overheads manageable. So
the first is to reduce software IO overheads, we want to get rid of as much of the existing IO
stack as we can that's been optimized heavily for disk drives over the last four decades. And
then the second principle is to refactor critical software across the hardware and the operating
system and user space. So things like the file that we can't get rid of, we want to split those
across all of the layers of our IO stack and put them where they can be executed most efficiently.
We also want to recycle existing components so that we can reduce the engineering cost, as well
as make it easier to adopt the systems that we're going to develop.
So this graph on the right gives you some idea of the latency reductions that we've been able to
achieve. I'll go over a few of these in a little bit more detail, starting with the reducing of
software IO overheads. So the first optimization that we made is actually to completely remove
the Linux IO scheduler component. So in Linux, you can have basically pluggable IO schedulers
that will do things like reordering requests as they're issued to disk drives to make accesses more
sequential. It turns out if you have very fast random-access storage, that's not really beneficial,
so you should remove that. So what we did first is set it to the NOOP scheduler, and this
essentially takes in a request and immediately issues it again.
But the problem is, even the NOOP scheduler puts all of these requests into a single queue, and
then they get issued by one thread sequentially to the driver. So you've got a huge roadblock to
parallelism that exists here. So, if we look at the graph on the right, you'll see a lot of graphs that
look quite similar to this throughout the talk. The y-axis is bandwidth in either megabytes a
second -- sometimes, it’s gigabytes a second, later. The x-axis is transfer size in kilobytes from
512 bytes up to about 512 kilobytes, and these are all random accesses. So in this case, we're
doing reads. The blue line represents the performance of our system with the NOOP scheduler
in place, and once we remove that IO scheduler and allow much greater levels of parallelism
going into our driver and issuing multiple requests at the same time from a number of threads,
our bandwidth obviously increases significantly, and that's what the red line represents. So we
can actually chop off about 10% of the actual latency of the IO access and increase our
bandwidth substantially with this optimization.
>>: What's actually -- what's underneath what's happening there that gives you that increase? Is
it because the device is capable of servicing multiple requests in parallel, or is it [inaudible]?
>> Adrian Caulfield: Yes, so that's part of it. So we can get increased parallelism at the device
level, and we're also able to essentially reduce the number of context switches that are happening
as we're issuing requests. So if I have eight threads running in my application and they're all
issuing IO requests, with the scheduler in place, each of those requests ends up inserting an
element into a queue, and then a single kernel-level thread will pull items off of that and issue
them through the driver. If you remove the scheduler, what ends up happening is that thread will
actually go into the kernel, and then the thread itself will call a function in the device driver and
hand off the request at that point. So we actually get a lot more threads in the kernel talking to
the device driver, able to actually issue these requests. So you need the parallelism at the device
level, but we also need to be able to get that somehow by allowing multiple application-level
threads to talk to it.
>>: Is the dynamic if you have multiple cores, you're parallelizing the work of talking to the
device driver over cores?
>> Adrian Caulfield: Yes. That's certainly...
>>: In other words, if you had a single-core machine, would you see the same speedup?
>> Adrian Caulfield: You would see some of it. Certainly, the latency reduction is still going to
be there. When we do all of these latency measurements, it's obviously with one thread issuing a
single stream of accesses, and that's because we're not context switching between threads to issue
the request, so the application thread is actually the one that talks to the driver and issues the
request down to the hardware.
>>: How many concurrent access transactions can the device support?
>> Adrian Caulfield: In this part of the talk, there's basically 64 threads, so we issue tags. We
have 64 tags available. They're assigned in the driver to each request as it's issued, and the
hardware can actually track 64 outstanding requests at the same time, as well. Later on, when we
have a lot of applications talking to it independently, they each have their own set of 64 tags, and
so there's a queue in the hardware that will basically get filled up by those, and then we can track
64 in-flight ones at the same time.
Okay, so this is the first optimization that we can make. The next is actually selectively using
spin waiting for smaller requests. So what we've found is that with very fast storage devices, it
actually makes sense to hijack the thread that's issuing that request and sit in the kernel, spinning
and waiting for notification that the request is completed. For our device, this works out to be
about four kilobytes in size, so anything smaller than that, it's actually more expensive to switch
to another thread, try and start doing some work and the have to switch back to complete the
request at a later time. And so we've set it up so that, for small requests, we'll just spin in the
kernel. For larger requests, we'll actually allow context switches to happen so that more work
can get done.
And so this change actually saves about five microseconds of latency, and there's a little bit of an
efficiency tradeoff here, because we are actually wasting CPU cycles waiting for a request to
complete. But we're also getting much better performance, and so maybe if you have one
application that's performance critical, you'd rather have the five microseconds than an extra
thread or something like that.
>>: I sort of lost track, so how much is this -- we're in this five microseconds? You had it 97%
overhead of software. How much of that percentage is going to account for it?
>> Adrian Caulfield: I don't actually have it -- I don't have that broken down.
>>: Well, there's a picture of it.
>> Adrian Caulfield: Well, yes.
>>: In the base column, the software is only about 50%. The 97% was with a much bigger
denominator, and the dark blue bar between 15 and 20 is the wait bar, and it disappears. In the
highlighted column, the blue bar disappears.
>> Adrian Caulfield: Yes, so we've gone from about 18 microseconds or so of software latency
down to 10 or something like that, with all three of these optimizations together.
>>: How bad is the hit to the CPU?
>> Adrian Caulfield: So a lot of the time that we're spinning, we would have been just context
switching between different threads, so you sort of weren't going to get that much useful work
done, anyway. Certainly, for the very small requests, that's true, but as the requests get larger,
you sort of start to see this tradeoff. So the cutoff is around four kilobytes.
>>: You chose that cutoff to optimize for the latency, right? Not to optimize for the amount of
CPU overhead? Presumably the cutoff for CPU overhead would be smaller.
>> Adrian Caulfield: Yes. I mean, it depends on whether you're actually -- it takes about two
microseconds to do a context switch, so if you do that switch, you might get a microsecond of
useful work done in your application before you have to switch back. Maybe 20%.
>>: Okay.
>>: So in that wait block that disappeared, what are we waiting for there?
>> Adrian Caulfield: Basically, that's the amount of time that we're spending in the kernel
waiting for a request to finish.
>>: Okay, so the application thread waiting. So the idea is that instead of the application going
to sleep, if the request is small, the application thread comes down, issues it and expects that the
results will be back so quickly there's no point in even giving up the CPU.
>> Adrian Caulfield: Yes.
>>: Do you disable this optimization when the request queue is full? Because if you're going to
have to wait for not only the four kilobyte read, but also a whole bunch of...
>> Adrian Caulfield: So the actual implementation is not quite as efficient, but we'll basically
allow it to wait for as long as a four-kilobyte request would generally take, and then, at that
point, it'll actually go to sleep.
>>: So you do the rent-to-own kind of thing.
>> Adrian Caulfield: Yes, basically, it optimistically thinks it's going to be very quick, and then,
if it turns out that will not be the case, we'll go back and let some work happen.
>>: That's sort of the worst of both worlds though, right, because you've spun the CPU for a
while and then still did a double context switch.
>> Adrian Caulfield: Yes. Like I said, it's not the ideal case, and obviously we would...
>>: It would be better to try to model how long it's going to take based on how many requests
are queued and then make a decision.
>> Adrian Caulfield: Yes. So one of the challenges there is that you would actually have to sort
of keep a count of how many outstanding requests there are, and we've actually worked really
hard to make it so that we don't need any locks in the driver. We don't have state that's shared
across a bunch of threads because the cache misses actually take a significant amount of time.
Yes.
>>: I'm curious what this does to overall throughput.
>> Adrian Caulfield: Well, since you asked.
>>: I'm curious what happens when you push this slide at [inaudible].
>> Adrian Caulfield: That's right. So this is another graph with bandwidth on the y-axis and
access I is on the x-axis. So the turquoise line at the top is our maximum PCI Express
throughput in one direction. Clearly, we're never going to quite hit two gigabytes because there's
overhead for PCI Express. But the blue line represents the baseline implementation, so we start
off with something about like 20 megabytes a second for very small accesses, and eventually, at
around 64-kilobyte requests, we can saturate our PCI Express bus. When we remove the IO
scheduler, we shift this curve to the left fairly significantly. And as we continue adding our
optimizations, this is removing a bunch of locking that happens in the kernel. So for requests
that are about four kilobytes or bigger, this has a significant impact, but for much smaller
requests, there's still a lot of overhead going through the kernel for accessing these data
structures, mostly because we're context switching. When we add in the spin waits, we can
actually recover a lot of that performance that we were losing.
And so we've gone from about 20 megabytes a second to 500 megabytes a second through very
small requests, and we get pretty good improvements for larger requests, as well, up to the point
where we saturate the PCI Express bus. So if we had a faster connection, we could actually sort
of continue these curves out to the right farther.
>>: So the capital, your asymptote and your ideal is to the PCI Express.
>> Adrian Caulfield: Yes. That's PCI Express overhead, the cost of issuing DMA requests and
waiting for the responses from the memory controllers and things like that. There is a certain
amount of overhead with PCI Express connections anyway that you just can't get rid of. Okay,
so all of these accesses, we've done a pretty good job of removing a lot of the latency. We've
significantly increased our concurrency so we can get better performance, but all of these
accesses are to a raw device, so basically there's no file system sitting on top of a lot of these
accesses, and it turns out that file system performance is actually fairly critical. So we have a
system that looks like this, a bunch of applications running at the top, our kernel-level interfaces,
Moneta underneath. The file system actually hurts our performance quite significantly,
especially for writes, and this is because we're spending a lot of time updating metadata, and so
we go from something about 1.5 gigabytes a second down to about 200 megabytes a second for
write access, and this is with XFS on the Linux kernel.
And so we need to find a way of actually addressing this performance discrepancy. How can we
make our file systems as fast as the accesses that we can get to the raw device? So this is where
the next couple of principles that we have come in, in refactoring critical software across our
hardware, our operating system and the user space environment. And so in this case, we're going
to refactor the file system and the sharing and protection information that needs to exist to make
that work. We'll try really hard to not break compatibility with our applications, as well, so that
we don't have to go and rewrite a bunch of software to take advantage of these systems.
>>: Could you go back a step?
>> Adrian Caulfield: Yes.
>>: So your y-axis there, megabits per second, is that megabits per second of user data being
written into the file system, or is that megabits per second of actual data going out to the storage
device, including the overhead that the file system is adding?
>> Adrian Caulfield: So this is user data. Yes. So we have a micro benchmark that essentially
is using the write system call to update file data. Okay, so let's look at refactoring and recycling.
We'll eliminate the file system and the operating system overheads. So we want to take a system
like this. We have applications, kernel level in the middle. The red box denotes trusted code, so
this is stuff that normally runs in the operating system and we're willing to say, okay, yes, it's
right. It has access to everything. And then Moneta lives at the bottom. And it turns out that
41% of the total latency and 73% of the remaining software latency goes to supporting,
protection and sharing in our devices. So this is like the file system cost with the operating
system context switches and things like this.
So we're going to take this picture, and we're going to split it into a couple of pieces. We'll
maintain the application level at the top. We're going to shift some of the driver interfaces up
into the application. We'll split the trusted code into two pieces, so that the file system and the
IO stack live on the side, and the permissions checking block we're going to move down into the
hardware. And the key here is that we're separating the protection mechanism from the policy.
So on the left side, the kernel does both. It sets the policy and then it enforces it. On the right
side, the kernel is still going to maintain the policy and decide who has access to what, but the
permissions checking blocking the hardware is going to end up actually enforcing that policy.
This allows us to actually give the applications direct access to the hardware so that we can
eliminate operating system context switches and all of those costs. So to be able to do this, we
need four pieces. The first is a virtualized interface to Moneta so that every application can sort
of talk to their own device without having to compete with the other ones.
>>: I have a question.
>> Adrian Caulfield: Yes.
>>: This seems really orthogonal to non-real-time memories, right? This thing could have been
done even with classical storage.
>> Adrian Caulfield: Yes, it could, but the performance gains are basically not going to be as
useful. There's no -- if you're waiting seven milliseconds anyway, it doesn't really matter if you
spend some time in the front.
>>: Okay, got it.
>> Adrian Caulfield: So we need the user space library that's going to sort of take over some of
the functionality that we had in the kernel. We need our protection and enforcement block, and
then there are some operating system and application-level implications as to what changes and
what we need to be able to do to be able to actually make this system work.
So the first piece here is Moneta Direct virtualized interface, and the key here is that we're
virtualizing the interface and not the device. So all of the applications are seeing the same set of
blocks, but they each have a separate set of registers and tags and their own communications
channel with the device, essentially. And that channel contains a unique section of the address
face exposed by the device, which contains a set of control registers. They have their own set of
tags. They said we could track 64 tags per application earlier. We need some way of signaling
interrupts back up to user space, and they have to have their own set of DMA buffers for
transferring data back and forth between the applications, as well. So Moneta Direct actually
supports 1,000 channels, and the point here is that we actually want all of the applications
running on your system to be able to take advantage of this, not just one or two specialized
applications. So it's really not a boutique interface.
So the second piece is adding this user space library on top of -- or underneath our applications,
and we're going to transparently intercept our file system calls, so if you normally would write
read and write calls to a file descriptor, we’re going to intercept those and translate them into
direct writes to the registers in the device and do all of the copying of data in this user space
driver. So this library has to provide a couple of things that the operating system normally would
have. So from the file system, we need to be able to translate file offsets into actual physical
addresses that we can ask for data from the device, and we do that by retrieving and caching this
information from the operating system. So the first time you access a block of data, you still
have to make a system call, and during that time, we essentially say, okay, I want to access offset
100-and-some file. It’s going to return what physical blocks hold that data, and it's also going to
send a message down to the hardware that says this application has permission to access this set
of blocks.
>>: So can files be shared across applications, or can directories be shared across applications in
this mode?
>> Adrian Caulfield: Yes, so they can. It gets a little tricky if you have applications that are
using -- one application using this interface and one application using the operating system
interface, just because the block cache tends to annoyingly get in the way, and so we have some
coherency problems that you have to deal with, because writes will sit in that cache for 30
seconds before they actually show up in the data. Whereas applications using libMoneta see
only the data that's in the hardware, because there's no cache.
>>: Even if they're both using the same interface, if one of them writes, does the other get
notified somehow?
>> Adrian Caulfield: But it's not caching any of the actual file data, so if one of them writes, and
the other one issues a read immediately after it, it's going to see the new data.
>>: I can see that the file data is listed [at the] device, but it seems like the metadata -- like you
extend the file or something, that there might be an inode that changes, so how does that get
shared across applications?
>> Adrian Caulfield: So, for example, if you're extending a file, those applications, one
application is going to extend that file. The metadata is actually still managed by the kernel, so
you're going to make a system call to do that extension, and the other applications don't know
about whatever region was allocated, so when they asked for information about that file, they
were given information about what currently existed. And so if they want to then read that new
data, they have to ask the operating system to update the permissions entry and make that data
available to them.
So it does actually work, but if you're doing a lot of metadata-intensive updates, this is probably
not the right interface for that kind of workload. This works really well if you have files that
you're updating in place a lot or that have a lot of read and write traffic that update the file data
but not the metadata.
So from the file system, I said we need to be able to translate file offsets. We need to be able to
essentially implement POSIX compatibility as best as we can. We get pretty close to full POSIX
compatibility, but there are a bunch of interesting sort of rarely used features, like synchronizing
file pointers between processes, that we don't necessarily implement. And then we also need
some aspects of the driver to be able to talk to the device itself, and issuing complete requests.
So the third piece that we need to update here is to be able to actually enforce protection in the
hardware itself, and the key here is the file system is still setting the policy. Whenever an
application wants to access a file, it asks the operating system to update a permissions entry and
install it in the hardware so that that application will have permission to access it. The hardware
is really just caching all of these protection entries that the kernel is generating, and we do that
using a permissions table. This is an extents-based system. Every channel has its own set of
mappings, and at the moment, we share 16,000 entries across all the channels. We found this is
pretty good if you have one or more applications running. If you start running huge numbers of
applications, you might actually start running out of entries. If this was an ASIC, we could make
this memory a lot larger than we're able to with the FPGA-based design.
Okay, so there's also a couple of operating system-level implications. With this system, we
haven't had to make any changes to the file system. We don't have any changes to the
applications, because we can just dynamically insert ourselves under the applications, but there
are some open questions. Your question hinted at what happens if you have a bunch of
applications running at the same time. It's fine as long as one of them isn't using the old interface
and one is using the new interface. And file fragmentation can actually be a bit of an issue for
this system, as well. Because we're using extents-based records, if you have a bunch of disjoint
sections of a file, they're going to use a lot more entries in our permissions table than one
contiguous region of memory would.
>>: It's fairly dense here. How is it that you have no changes to the file system, given that
you've taken the permission check out of the file system and put it into the hardware?
>> Adrian Caulfield: So what we've done is essentially -- there is an interface that most file
systems implement -- I mean, Linux, you can use it on any file system -- that will basically allow
you to query and translate a file offset into a region of physical addresses in the storage device.
And so what we've done is, basically, when you open a file through libMoneta, we query that
interface. We can get the range of blocks that are represented or that are storing that data, and
then you can just go and read and write that data without needing to communicate with the file
system anymore. So we've already done the permissions check to make sure that you have
access to open this file and to read and write from it, and then we've told the hardware that this
application now has access to this region of data.
>>: So there's a known interface by which you can query the state of the file system's permission
table, essentially?
>> Adrian Caulfield: Yes. It's more the actual layout of the data on the disk that we're interested
in. The permissions table is also part of that, but it's sort of the smaller piece.
>>: The open itself is what checks the permissions, and then once the open has passed, then the
fact that you can query and get those extents back is what is authorizing [inaudible]. And then
this is coming back to what you said about mixing interfaces. You can't mix interfaces, because
if somebody is going through the file system interface, they might be changing the set of extents
that are involved in the file. That's just summarizing the previous.
>> Adrian Caulfield: Yes, essentially. I mean, the big problem with mixing the interface, it
actually varies depending on which file system you're using and whether it's actually relocating
data or something like that. But the big issue is that if I issue a write through the normal
interface, it's actually going to sit in the cache for a little while. And once it's been read once, it's
not actually going to go back to the disk and ask again, so you could have stale data sitting in the
cache.
>>: That's an issue with respect to consistency, but you also have an issue with respect to
security, which is a little scarier, right? I mean, if the other interface frees some blocks, then the
interface that's connected to libMoneta can now look at this extent that might get filled with new
data from another application that's not using the Moneta interface, and it shouldn't be seeing that
data.
>> Adrian Caulfield: Yes.
>>: Does the current system -- is that just a security flaw in the current design?
>> Adrian Caulfield: A little bit, and we actually have some future work that is looking at
actually using some caching, using the same interfaces for caching data and using Moneta as a
cache. So with that, we have a way of shooting down the permissions entries. I mean, certainly,
we could insert hooks in the kernel to remove the entries in the hardware before you move data
around, and that would solve this problem. It's not in the actual implementation for this work,
but it's certainly solvable. Yes.
>>: Would this work for something like a journaling file system, where the mapping of file
offset to a visible block is not consistent?
>> Adrian Caulfield: So as long as you're not moving those extents around underneath, yes. So
if another application is updating the data and moving it, writing into the journal, then those
extents are going to be moving and we're going to have to keep refreshing what permissions
entries are installed for the other applications. But once the data is on disk, it's not going to
move until the file system gets another update or you sort of write at the end of the log with the
same file offsets.
>>: Sorry. I wasn't talking about the permissions. I can see how this works for reading, that for
every offset there is a block dependent on this [inaudible]. But for writing, the translation of
where you're trying to write to the physical block is not something that you can predict in
advance, and so it doesn't seem like this would work with a journal.
>> Adrian Caulfield: So it doesn't maintain the journaling aspect of the file system, but you can
go and update the data in place, right? If you've already written that block of data, it has a
physical location on the disk, right?
>>: But the file system isn't going to expect you to be writing it there, though. It's going to be
expecting you to write in the journal.
>>: On the journal, are you talking about a log-structured file system, actually?
>>: Sorry, yes.
>>: So he's not talking about journaling in a conventional file system but a log-structured file
system. Communication is actually [inaudible].
>>: Would you be allowed to do that versus just the [inaudible].
>>: That's right, and where the data is located. So what Jeremy is saying is that in a logstructured file system, the data moves.
>> Adrian Caulfield: And so what I'm saying is, as long as -- right. We're basically breaking the
log-structured aspect of the file system. If you update the data through libMoneta, we can find
where that data is currently living in the device, and we can update it.
>>: I guess maybe I missed what your model is. In other words, are you creating a new file
system or are you creating a new interface to existing file systems?
>> Adrian Caulfield: We're creating a new interface to existing file systems.
>>: But if you're updating -- if you have a log-structured file system that expects a write to
happen at the end, if you just substitute it randomly into the middle, then if you later try to read it
using the traditional file system code that's sitting in the kernel, it's not going to know where to
find it, it seems.
>> Adrian Caulfield: No. It's going to be exactly where it would have looked for it before. As
long as it's not in the buffer cache already, it's going to go and read whatever the latest copy of
the data is. If you do a read in a log-structured file system...
>>: Can I give you an alternative answer?
>>: Yes.
>>: Log-structured file systems are there to reduce to seeks and write time.
>>: I understand that they're not just there -- we're trying to figure out if they're correct.
>>: Yes.
>>: If even in SSDs you have a reason to use log-structured file systems, can you avoid
overwrite in there? I mean, that's the portion I have been waiting to hear, if you are doing
anything on the [inaudible] mapping.
>> Adrian Caulfield: I mean, devices like Moneta don't actually need an FTL. The memories
have very simple wear management requirements, so we can actually get away without needing
to store a map for the whole device.
>>: Or is that because you assume that type of memory you are using is PCM, basically these
types of devices?
>> Adrian Caulfield: Yes. And so a system like this would work for flash, but the latency
difference is perhaps not enough to sort of require that we start moving towards different
interfaces to the storage.
>>: And I have other questions.
>> Adrian Caulfield: Yes.
>>: If you are talking about basically a PCM-based system -- but let's continue this.
>>: Before John pointed out that the question was the moot because you wouldn't use the logstructured file system here, you were about to answer the question anyway, and I was about to
gain some understanding about how your system works. Can you just finish answering the
question of what would happen if you wrapped a log-structured file system with this?
>> Adrian Caulfield: Right, and so the short answer is, it will work, but you're sort of losing the
log-structured semantic of the file system. If you write data in a log-structured file system,
normally, it's going to change where that block is located on the disk and it's going to write it at
the end of the log and update a map, where it says, here is where the latest copy of this data is in
my journal, or in the log. And so when we do a read from that, you ask the file system, hey,
where's this data? It's going to tell you that it lives at the end of the journal, at location five, or
something like that. And so we can get that same information, and when we do a write, we can
go and update the journal itself -- perhaps not the last entry in the journal, but somewhere in the
journal this data exists, and it's the most recent copy, so we can go and update that data in place.
>>: Oh, okay. What I wanted to find out was what happens during a cleaning operation, but I
guess it just stops working.
>> Adrian Caulfield: Yes. So if you move the block of data, for those file systems...
>>: If you open a file and then the cleaner comes along and moves the blocks, you're hosed,
right?
>> Adrian Caulfield: Yes, so we don't cache the whole set of extents for a file as soon as we
open it. It's sort of on demand, so we ask for the extent that contains whatever offset you are
trying to access. And so what ends up happening in that case is you'll remove the permissions
entry for that extent, because the data isn't there anymore. The user space library will try and
perform that access and then it'll have to ask for permission again, and it will tell you where the
data is located. It could be made a little bit more efficient by having the right hooks in the file
system to essentially preemptively do that, but we don't do that at the moment.
Okay, so let's look at what the performance difference is, or what performance impacts we can
have from a system like Moneta Direct. Again, this is a same kind of graph, bandwidth on the yaxis, access size on the x-axis. The blue line represents the file system interface performance, so
if we're accessing data through the file system, performing a bunch of writes, then we get that.
The green line represents the raw device-level performance. This purple line represents the
performance of accessing our device -- a raw device -- through the user space level interface. So
you can think of this as a file system that has one extent that covers the entire device. And so
basically, we open the device and we can read and write to it. You have full access to it.
And what we'd like to see when we add the file system back on top of this is that the
performance will be as close to this purple line as possible. Yes.
>>: The user space one is doing better because the kernel space one is one where the user has
some data, gives it to the kernel and then the kernel gives it to the device?
>> Adrian Caulfield: Yes.
>>: And the user space one is just more directly through this?
>> Adrian Caulfield: Yes. So, basically, this is the cost of the context switch and the minimal
does he have access to this raw block device. So there's a bit of performance improvement there.
We go from about 500 to 850 or 900 megabytes a second, and so we'd like to see the file system
level performance be as close to this as possible, as well. And so that's actually what we get. So,
basically, we've completely eliminated the file system overhead. There's a one-time cost when
you open a file to sort of read in the extents the first time you access that block of data, but after
that, we can just read and write to the file in place, and we don't have to go through the operating
system or the file system at all.
And so we've gone from the 1 million IOPS we had before for small accesses up to about 1.7, so
this is a pretty significant increase in performance. So another thing, we've worked pretty hard
to try and maintain the same interface to our file system you don't have to change your
applications at all, but sometimes it can actually be a little bit beneficial to go and rewrite your
applications to take advantage of new features that your storage devices have to offer. And so
libMoneta actually provides an asynchronous interface to the device, as well, so instead of
capturing read and write calls and intercepting them and translating them into accesses, we can
go and modify the applications to actually perform asynchronous operations. We'll start a
request, we'll go do some useful work, and then we can wait for it to finish later.
So this graph at the top, the red line represents the synchronous one-thread performance. So this
is one thread doing random accesses. The dark-blue line is one thread using our asynchronous
interface, doing random accesses, and the turquoise line is eight threads using the asynchronous
interface. And so we can get about a 3X performance improvement for 32-kilobyte requests
using our asynchronous interface. And we also changed the ADPCM decode stage from
MediaBench that's basically doing some audio processing to use this asynchronous interface so
we can be loading in the next block, processing it and writing out the previously processed block
at the same time. And we can get about a 1.4X gain doing that, as well, versus using our user
interface directly.
>>: What's it stand for, for ADPCM?
>> Adrian Caulfield: Sorry, what's the scale?
>>: What does it stand for, for the ADPCM? Basically, it's a DPCM recorder?
>> Adrian Caulfield: Yes. We're basically decoding PCM audio, I think.
>>: But why are you showing it here, basically? You are showing audio files for the ADPCM
format on the files, and then you try to read and decode it, send it to audio interface?
>> Adrian Caulfield: So we're basically converting it from the PCM encoding to another format
and writing it back to disk.
>>: And you are showing the speedup in terms of you [ph] can move this data.
>> Adrian Caulfield: Yes. Basically, so if you have a fixed-size file, say it's 100 megabytes,
you process it. It takes a certain amount of time with the synchronous interface, and with the
asynchronous one, we can do it 1.4 times faster. So, basically, we have one interface and we -okay, sorry. We have this benchmark written using read and write system calls to essentially
load in a buffer, process it and then write that out and start processing the next one, and so that's
the baseline. And then the improved version does the reading of the next piece asynchronously,
and then it processes the currently loaded chunk and it will write out the previously processed
one at the same time, using asynchronous calls, as well.
>>: What's the purpose here? You are demonstrating this benchmark. You are showing the
improvement using the asynchronous interface that you can basically issue a request and wait for
it -- basically just ping and wait for the data to come in?
>> Adrian Caulfield: Well, so basically, I'm trying to show that if we're willing to change the
interfaces that we write our applications with, we can actually get some fairly significant
performance improvements, as well. So libMoneta does a really good job of speeding up normal
read and write calls by translating those into accesses directly to the device, but if we're willing
to use an asynchronous interface, we can actually sort of start pre-fetching data before we need
to actually use it, and we can then do useful work while the data transfers are actually happening,
versus the sort of normal, synchronous interface.
>>: I saw it probably for your basically demo products, a database-like application for a
[inaudible] probably is most interesting. The main reason is this. All media applications, media
encoded and decoded applications, can be heavily pipelined. For instance, ADPCM audio
decoding can basically just read a bunch of ADPCM modules, decode, play it out. I mean, there
is no reason basically just as a small block in the makeup.
>> Adrian Caulfield: Sure. I mean, it's just a little example, basically, to say -- the existing
benchmark in MediaBench has done this as efficiently as possible using a synchronous interface.
It's already pipelined and it's doing the right things to get good performance, but if we change the
interface, we can actually do slightly better. Certainly, other applications are going to have
different benefits and costs and things like that.
>>: Is the performance improvement you're showing in your graph application specific?
>> Adrian Caulfield: So this is just -- again, it's a little micro-benchmark, so it probably depends
a lot on what processing you're actually doing, but basically, we can actually issue more IO
requests with one thread. And so if you have a thread that can generate a lot of IO before you
need it, then you can get this benefit.
>>: I'm a little confused how to square this result with the one from 12 slides ago, where you
said that waiting is a bad idea because the overhead of setting up the wait is longer than we
expect to wait for the request to come back. That suggested that we want to be synchronous and
not asynchronous, but here you're suggesting the opposite.
>> Adrian Caulfield: I don't think that they're incompatible. Basically, if we have a large
number of threads, it makes sense to allow some of those to wait for IO requests, but again, that's
in the synchronous, I'm going into the kernel and I'm paying a pretty heavy context switch cost to
essentially get into the kernel and issue my request. And so the penalty there is when you have
to context switch back to other threads, right, there's a large overhead.
In this case, we have one user space-level thread that's going to issue a bunch of requests and
potentially be able to do useful work, without ever paying any of those context switch costs. So
we're getting microseconds of time back here that we would have just otherwise been spending
switching into the kernel and coming back. It depends on what work you're trying to accomplish
while your accesses are happening, as well. Certainly, if you're reading tons and tons of data and
you know that you're going to need it a second before you actually do, you should do that
asynchronously and then do your work, and then the data will be ready by the time you're
actually ready to access it.
>>: One of those lines is [inaudible].
>> Adrian Caulfield: It's A cores [ph].
>>: A cores [ph], okay. I mean, you could also have workflows that use multiple threads instead
of asynchronous operations. If you did that, would you want to turn off the optimization from
the previous slide?
>> Adrian Caulfield: Of?
>>: Of waiting rather than [inaudible]?
>> Adrian Caulfield: It depends. A lot of these things depend on exactly what workload you're
running. So if you have a large number of threads, if each of those threads could actually be
issuing a number of IO requests at the same time, it makes sense to have as much work as
possible for the storage device to do in the queue, ready, already on the device. So if you can
actually generate those requests and then continue doing useful work, asynchronous interfaces
make sense. If you can't make any progress until the data comes back, you really want the
lowest latency possible, and at that point, spinning and saying -- so that you're ready as soon as
the data comes back is probably the better choice. I don't think it actually really matters how
many threads you have. For each thread, you sort of have to make that decision of do I want this
request to process as soon as the data is done, or am I okay waiting a little bit of extra time but
having 10 requests complete and the data available for those. That's kind of the tradeoff.
Okay, so this is sort of the optimizations for local storage. We've also looked at how we can
extend Moneta-like devices out onto the network and attach multiple devices together and form
distributed storage solutions. And so we've applied our three principles to distributed storage, as
well, and essentially we're looking at how we can reduce block transfer costs, how much extra
time does it take to actually go from one device to another one, things like Fibre Channel and
iSCSI are sort of the examples of what I'm talking about here.
We can also refactor other features into hardware, if you wanted to, like replication, things like
that, and we want to be able to make sure that we can continue using the existing shared-disk file
systems that are out there, continue allowing use of our user space access libraries and things like
that. Sorry, so basically we go from a situation where this, where we've done a very good job of
decreasing the latency for our local storage requests, but then we add in some sort of software
stack for sharing devices out over the network, like iSCSI, and we've basically gone back to
squandering our performance gains that our memory technologies have to offer, and so we want
to be able to do something about this. And so to study this, we essentially took a bunch of
Moneta Direct devices, connect them all to a network, and now we have a situation where there's
a large amount of storage distributed throughout our network, with a little bit of storage attached
to each node. So we have the opportunity to take advantage of data locality, as well, so if the
applications are aware of what data is stored on what nodes, we can take advantage of that at the
same time.
To build this device, we take our existing Moneta architecture and add a network interface to it.
And so now whenever data is in the transfer buffers, we can actually just generate a packet and
send that to another device or vice versa. Requests can come in over the network and we can
process them locally. And so we have a very low-latency interface to the network from within
our storage device. And we also get to take advantage of the permissions checking blocks and
the virtualization components that already exist, so we can actually do the permissions checks
before the requests go out over the network.
So the first thing we looked at here was reducing the block transport costs, and so these are sort
of the protocol and physical layer costs for accessing storage remotely, so this is independent of
what it costs to actually access the memory. It's independent of the operating system and the file
system overheads. It's essentially you take a disk and you attach a network interface to it, how
much extra time does it take to use that network interface versus attaching the disk locally? So
with QuickSAN, we are able to use raw Ethernet frames. We have extremely minimal protocol
overhead. We're essentially just sending the destination addresses and about eight bytes of data
with each request to signal where it comes from, and we're taking advantage of flow control at
the Ethernet level to make sure that we have a reliable network, so we don't have to deal with
retransmitting packets and things like this.
And so we can go from something like iSCSI, which has a latency of about 280 microseconds,
down to about 17 with QuickSAN. And for comparison, Fibre Channel adds about 86
microseconds of latency overhead, and this is doing all of the packet generation and processing
and hardware.
>>: Which side is running the permissions check?
>> Adrian Caulfield: So the local storage side in this case is actually handling the permissions.
So you have to trust the network in that you're basically on a...
>>: So local, you mean the guy who is issuing the read-write requests, not where the bits are
located.
>> Adrian Caulfield: Correct. And you could do it both ways. This way, if you don't have
permission, you find out much sooner, and if you have a shared disk anyway, like they all see the
same set of permissions and metadata.
>>: I'm confused by the Ethernet flow control thing, because I remember, like, there's the
physical there, that's Ethernet flow control. It's typically [inaudible]. What's Ethernet flow
control?
>> Adrian Caulfield: So it's part of the 802.3 standard. Basically, there's a mechanism that NICs
and switches can basically send each other packet-like pause frames that say don't send me
anything else for X number of seconds or microseconds or so. And so it's actually -- it's not as
commonly used, but other storage area network technologies do essentially the same thing. They
require reliable networks, and so they have some form of flow control built in.
>>: There are reasons to drop packets other than full queues, but you said you weren't do any
packet retransmission, so is that up at a higher layer, or it happens there?
>> Adrian Caulfield: So in the systems that we've built, basically, we have a dedicated network,
and so we have not had to deal with any sort of packet loss in our layers. But, basically, we
could deal with it by essentially detecting if the request doesn't actually complete and reissuing it
at the software layer or something like that. But because we have sort of guarantees of how
much buffer space is available...
>>: But I was talking about packet losses not related to full queues at all, like a checksum that
fails or something.
>> Adrian Caulfield: Yes, like I say, we haven't run into that particular problem. I think the
right way to do it with this solution would be to actually have timeouts that say if I don't get a
response back in a certain amount of time, issue it again. But we were trying to minimize acts
that we have to send around and keeping data around so that we can reuse the buffer space for
other requests.
>>: How do the other systems, the iSCSI and Fibre Channel, deal with that? Do they do
retransmission?
>> Adrian Caulfield: I'm not 100% sure on Fibre Channel, but iSCSI runs on top of TCP/IP, so
they've sort of got the reliable communications channel underneath.
>>: Can you comment on the work of [inaudible] RDMA or converged Internet, and basically
how IO work [inaudible]. Does that share some similarity with that work?
>> Adrian Caulfield: Yes. There are some similarities with RDMA. We're doing it a little bit
different. It's remote, direct storage access instead of remote direct memory accesses, so we're
not actually putting data into RAM on another machine. It's going through the SSDs and things.
The permissions-checking pieces are certainly not there in RDMA and things like that. And as
far as converged Ethernet goes, this is sort of moving in the same direction, right? We want to
be able to use networks for both normal Internet traffic, as well as storage traffic. One of the
pieces of that is having the ability to support flow control in part of that channel and things like
that. But because we have dedicated network interfaces to the storage, we're not really dealing
with other forms of traffic at the same time.
Okay, so let's see what the performance improvements we can get with devices like this are. So
if we're comparing against an iSCSI stack here, and essentially, iSCSI performance is pretty poor
because of those really high software overheads that I talked about. Fibre Channel would be
somewhere in between these two numbers. Basically, because QuickSAN has this much lower
protocol overhead, we can get fairly significant performance improvements, as well. And we
also compare this against a centralized version of this configuration. So this is a more typical
SAN setup, where you have one large server in the middle that has a bunch of storage attached,
and then you have a number of clients that are sharing that storage. So we can model that by
having a bunch of Moneta boxes or QuickSAN SSDs connected to a single storage server, and
they're all connected to the network, as well. And then we have one or more client interfaces that
are using the same device, but were not touching the storage or sharing the storage attached to
that device.
And so if we look at this, we get about a 27X improvement compared to the 32 or so that we
were getting with the distributed case. And this is because some of the data was available locally
in the other configuration, and so we actually get a locality benefit. Some of the accesses that
you're performing don't actually have to go out over the network. iSCSI is pretty bad in both
cases, and that's because of the software overheads. There's not really a huge benefit to moving
the data out into the network, at least in terms of raw throughput for completely random
accesses.
So if we look at a workload running on top of this, we've implemented a distributed sort, an
external sort, so this is often used by systems like MapReduce and other large data sets, and this
is based on TritonSort, which was developed at UCSD by George Porter. And, basically,
distributed sorts work in two phases. The first step partitions your data, which starts out
completely random on all of your nodes. We're going to partition that into groups of ranges of
key values stored on each node. So if you have four nodes, we split the key space into four
spaces and you move the relevant pieces onto the other storage devices, like the storage devices
that handle that set of the key space.
So that part uses the storage area network significantly, because we're essentially writing data
from every node to every other node. Once the partitioning is done, we can sort the data, and
this happens locally on each of the nodes, so wherever that range of the key space is, we end up
with an assorted range of keys. So QuickSAN can actually leverage the nonuniform access
latencies, because we have some of this storage distributed out into the network, and this gives us
some pretty significant performance improvements. So comparing iSCSI to QuickSAN and the
centralized and distributed versions, so this is the latency for sorting 102 gigabytes of data spread
across, I believe, eight nodes. So it takes about 850 seconds to do this sort with iSCSI in the
centralized form. We get about a 2X speedup from using distributed, and so in this case iSCSI is
actually benefiting from having a lot of the data locally, as well. And QuickSAN does about
three times better than the iSCSI implementation on this particular workload. And we get a 2X
speedup from using the centralized versus the distributed cases.
>>: How much purchase [ph] are you using and what are the Ethernet interfaces in these?
>> Adrian Caulfield: Yes, so the storage, the Ethernet interface is 10-gig Ethernet. They're
Nehalmen machines, I believe, eight Nehalem machines.
>>: It seems -- I guess this is a follow up to Jim's question about your audio encoder. This
seems like a really bad application benchmark, because this is a classic big data application
where you know exactly what data you're going to read. You don't have to issue small writes or
reads. You can issue huge ones and you can do them asynchronously all the way out the end of
the file the moment the app starts. There's no -- in other words, chasing after latency and trying
to reduce the latency of a request only matters if you're blocked until the request finishes, but
that's not the case at all with storage. With storage, you know exactly what you're going to be
reading out to quite some time, so there's no point in doing small reads and writes. You may as
well do the large ones.
>> Adrian Caulfield: That's true, and that's one of the reasons that the iSCSI implementation is
actually not that much slower than QuickSAN one, right, because we're actually able to do these
large data transfers. We've looked at other applications, as well, and clearly, if you're doing
large random accesses or small random accesses all the time, a device like QuickSAN is going to
be much better than the equivalent iSCSI.
>>: Right. I'm surprised that iSCSI is slower at all, because as long as you have enough requests
outstanding, the network is always busy, or the underlying device is always busy, whichever is
slower, and it doesn't really matter what the latency of the underlying requests are. It just matters
-- all you care about is the utilization, the throughput, and not the latency in the underlying
requests.
>> Adrian Caulfield: Yes.
>>: So is it that iSCSI for some reason is underutilizing the channel there, or what's happening?
>> Adrian Caulfield: It's possible. We're reading at least a megabyte a time in this workload,
and so the overhead that iSCSI has is actually just a lot of the software stack that exists. So even
though we're using it in an ideal case for this workload, there's still...
>>: When the overhead is high, are you saying then that the whole thing is CPU bound, or that
you're spending much of your time processing iSCSI overhead on a CPU?
>> Adrian Caulfield: I'm not actually sure what the breakdown was in CPU usage for this. I
imagine that there is a fair amount of overhead, but basically, because you're sort of doing a lot
of operating system context switches -- the iSCSI stack is actually pretty involved. If you look at
how these devices work, if you trace the flow through...
>>: But that shouldn't matter unless the CPU is the bottleneck resource. In other words, you can
do as many context switches as you like if what you're waiting for is for big, giant, bulk transfers
to come over the network or something. So I'm trying to understand if you're making an
argument here that the bottleneck resource was the CPU, and so reducing CPU overhead sped it
up, or is something else happening?
>> Adrian Caulfield: I think we had available CPU cores, so I don't know that actual amount of
processing power was a problem. It's possible that the one thread or several threads that are
doing these transactions are actually not able to get through all of those layers of protocol
overhead fast enough. I mean, I imagine that's probably the case, but basically, when you issue
an IO request or an iSCSI request, it goes down into the kernel. It's like a virtual block device,
and then there's a user space iSCSI daemon that runs, so you end up going back into user space,
and then you go back down.
>>: So let's say that takes a second, so just do a lot of them.
>> Adrian Caulfield: Sure. But in that case, then, we clearly don't have enough CPUs to have a
second of latency absorbed by that many threads. So that's probably where...
>>: So if it's literally a second a core, I agree, but you said you're not CPU bound, which
suggests you can keep launching threads and do more than the parallel.
>>: I think the short comment of Jeremy's comment is sort is a heavy parallelizable application.
Considering the time that...
>> Adrian Caulfield: Okay. So let me conclude the Moneta section here. Essentially, emerging
non-volatile memory technologies offer huge potential for extracting more knowledge out of the
data that we've been able to acquire over the last years, but it's going to require better system
design to be able to do this effectively. And the key here is we have to prevent squandering a lot
of the performance overheads that these technologies offer. And software overhead are
becoming the critical piece in that puzzle at the moment. So I've shown you these three
principles, reducing software overheads, refactoring pieces across all of the system stack, and
recycling pieces of that stack that we can, like not rewriting our file systems and things like that.
So we've been able to reduce overheads by about 84%. We can get 30X higher 512-byte IOPS.
I'm going from 20 megabytes a second to 500 megabytes a second for those transfers. And I've
shown you that distributing storage out into the SAN is perhaps a good idea, and that we can
build these very low-latency interfaces to network storage, as well. So let me just give you a
quick overview of the timeline of the Moneta work. So in 2010, we had our first prototypes. We
had papers in Supercomputer and Micro. I was the lead student on this project and basically
designed all of the infrastructure, built about 60% of the hardware and all of the software stack to
make the system work.
In 2011, we released Onyx, which is an actual phase-change memory SSD prototype built on top
of Moneta, so one of the students in our lab built a DIM with a bunch of phase-change memory
on it and we dropped those into the B3 boards. Performance is not nearly close -- not that close
to the projected PCM latencies, but it's first-generation PCM. 2012 was the Moneta Direct work,
so we’re doing direct user space access at that point, and then in this year, in ISCA, we'll be
showing off QuickSAN more.
I've also worked on a couple of other projects at UCSD, so we had one of the first sort of wimpy
data-centric computing nodes in Gordon. This was actually published in ASPLOS of 2009 and
selected for Micro Top Picks, as well.
I've looked a little bit about NV heaps, which is looking at how we can attach non-volatile
memories to the DRAM bus, and essentially how do we implement a transactional memory
interface to sort of protect those technologies and make them accessible and easier to program.
And I've also built several hardware platforms for doing flash memory characterization and SSD
research, as well. But that was back in 2010 or so. So I'd also like to look a little bit about a
couple of future directions that we can move storage research. So one of these is whole system
optimization, the Moneta-style work. I'd also like to look at how we can make compilers smarter
at actually interfacing with storage and what we can do to get more optimizations automatically
by adding features to the compilers. Also, I think there's a lot more to be done looking at storage
and networking.
The first piece here is system-wide optimization. So Moneta-style research, where we're
building hardware and tuning all of the layers of the software stack on top of that are kind of the
things that I like to work on. And I think as fast storage systems become more available, it's
going to drive changes in other areas of the system stack, as well. Things other than storage.
And I think this style of work is a good fit for looking at a number of problems.
I'd also like to look at storage-aware compilers. So I think that this cartoon at the bottom gives
you an idea of what I'm talking about. At the moment, when you write an application, you
compile it, you use read and write system calls, those are essentially black boxes that the
compiler can't reason it out, they can't do anything about. And storage actually has some pretty
well-defined semantics, and so we should be able to promote storage up into the realm of things
that compilers can reason about. If we have a well-defined interface like libMoneta, we can
actually start having compilers automatically convert calls from synchronous interfaces to
asynchronous ones or to eliminate redundant read and write calls to the operating system and
things like that.
And I'd also like to look more at storage and networking. So QuickSAN is really taking the first
couple of steps at how do we attach really low-latency storage to networks and what are the right
interfaces for those? But I think there's a lot of broader system design questions. We're not
really looking at what synchronization primitives are we supporting for distributed file systems
to keep things in sync. We're still using the same locks that we had before. Things like
replication consistency concerns are problems, as well, and I think we need better interfaces and
protocols between nodes to really be able to take advantage of these technologies. So thank you,
and I'd be happy to take any more questions that you guys have.
[applause]
>>: Any quick questions.
>>: Where does the name come from?
>> Adrian Caulfield: Moneta? So it's actually the Roman god of storage or something, or
goddess of storage or something. If you on the NVSL webpage, there's a picture of a coin or
something that somebody has that has Moneta on it.
[applause]
Download