>>Galen Hunt: All right. It is my pleasure... known each other since Mike was a graduate student at...

advertisement
>>Galen Hunt: All right. It is my pleasure to introduce Mike Swift. Mike and I have
known each other since Mike was a graduate student at UW and I used to go admire
his nooks papers and then go up to him afterward and go, wait a minute, this has got
to be a lot harder than this. What are you going to do about [inaudible] and things
like that? At which point, Mike would have the discussion of the difference between
product code and research code, things like that.
And, oh, Edsy [phonetic], you’re in good shape! Edsy showing up. And let’s see, I
guess the other thing. So we actually did try to hire Mike, but he had a life-long goal
of being a professor.
>>Mike Swift: That’s true. My father is a professor and I had to live up to his
aspirations.
>>Galen Hunt: At the point in the selling conversation where Mike said, look, I saw
my dad teach when I was twelve and I’ve had a dream to be a professor ever since
then. So he had to go try.
>>Mike Swift: You know, at this point I can’t remember if that’s really true or not.
[inaudible] since I told you it was true. Whether it was in a dream or just an
assumption ->>Galen Hunt: It made me feel better in letting him go to Wiscons. So anyway,
we’re glad to have him back to tell us about the work he’s been doing.
>>Mike Swift: Okay, thanks. Anyway, thanks y’all for coming. I’m going to be
talking about the work that I’ve been doing on looking at the interface of solid-state
storage devices like Flash or Phase change memory. So to put this work in context, I
think I’m probably most known for work on device drivers and I’ve sort of expanded
since then. And my real interest is kind of the bottom half of the operating system
where you’re really focusing on interactions with the hardware and what that looks
like.
So with device drivers I first look at liability and then I’ve been doing work on sort
of different models on how you program drivers, or where the code runs and
running it in languages other than C. In terms of memory I spend a lot of time
looking at transactional memory and how that relates to the operating system. How
do you virtualize it, handle page fault and context switching? Recently I have some
work on dynamic processor. The processor can reconfigure and again, what does
this mean to the operating system?
Most recently I’ve been looking at solid-state storage. And my interest here really is
saying if we have these new devices, what does it mean to the operating system?
Are they really just faster, smaller, lower power disks? Or are they something
different that we should treat differently and think about new ways of interacting
with them? So we look at storage trends over the last ten years, I think as all of you
know, there’s been this huge trend from sort of spinning [inaudible] disks to solidstate disks, and then hopefully we’ll get to this. This is the Onyx prototype from
UCSD, which is built on phase change memory, which is sort of again an order or
magnitude or two faster than flash storage.
And so what’s interesting about this is as you look at these devices, these newer
devices are not just smaller faster disks, but they have dramatically different
properties and they behave very differently. And I think that’s what makes it
interesting to look at how we interact with them. So first looking at flash drives, the
key property of flash drives, as I look at it, is that [inaudible] the storage medium is
something you can’t directly overwrite. You have to erase large blocks of flash with
is pretty slow in the order of milliseconds before you can write to it.
And what this means is that inside most flash drives there’s a piece of software
called the flash translation layer that will take incoming addresses and then route it
through some kind of mapping table to find where the data actually lives within the
device. In the device, data is written usually in some kind of log structured format
where whenever you write new data it writes it to a new location and then you can
asynchronously garbage collect the old data. So this is very different than
traditional disks where you just overwrite data directly.
In addition, a key feature to flash is this limited endurance where you can only
overwrite flash a limited number of times, which means that you have to spend a lot
of effort in terms of where you do these things to sort of level your writes out across
the whole device. And the net result is you’ve got this fairly sophisticated piece of
software on a processor inside your flash device doing all this interesting work.
So the second new piece of storage technology is storage class memory, which I
know there’s been a lot of work at Microsoft. And the key features that I’m
interested in are first that you can sort of attach it to your processor in a way that
makes it act a lot like DRAM. And this means that instead of using a device driver
with a block interface, you can potentially just use normal load and store
instructions within the processor to access memory. Furthermore, it’s very fast, so
it’s faster than flash, and this means that software overhead matters a lot because all
that software overhead really sort of hides the short access time.
So the things that this potentially enable are things like having very fine-grain data
structures that we make persistent because we’re not limited to just writing out
data 4k at a time, we could write out four bytes at a time if we wanted to. And
furthermore, it allows perhaps direct access form user-mode programs because if
we can just make it look like memory, programs can do load-and-store instructions
and they can potentially just do load-and-store instructions against this storage
class memory to access persistent data.
So, again, this is very different than disks where it is very slow and you need a
device driver. So if we look within the operating system, though, and we see how
does Linux, for example, abstract these kinds of devices? What we see is that really
not a lot has changed. That if you use flash as a file system, which is what most
people do, you still have your standard read and write for reading and writing data
and then sync for forcing data out and making it persistent. You can do m-map and
m-sync if you want, but it still is more or less just like a file in a file system.
Lower level within the operating system there still is standard block interface,
which has been around for a while, which is basically reading a block with b-read.
And then you can submit a synchronous IO request using submit BIO to read or
write a block. But it’s still just transferring a block of data in or out at a given
address. No acknowledgement whatsoever that these things are more
sophisticated.
So the reason this is important is that these devices really are fundamentally
different and they have really interesting features that using the old interface
conceals. So if we look at flash SSDs, they have address translation built in. And I
think every operating system person knows that address translation is a great
feature that we’ve been using for all kinds of nifty things like copy and write, and
shared memory, and protection and things like that. And this great feature is built
into devices you can buy. But we can’t get at it to do anything interesting with it.
[inaudible] garbage collection, which again is a very powerful general-purpose
technique for managing data structures that is being done inside the device, but it is
not accessible outside the device to do anything interesting.
And it would be nice if there were a way to make these devices available to
applications. So one thing you can look at is anybody who is building high
performance applications that runs on flash typically tries to do their own garbage
collection on top of it where they’ll sort of write things out sequentially and then
they’ll, because sequential performance is so much better, and then they’ll do their
own user level garbage collection where they compact and copy data. And that’s the
same thing the device is doing internally and there’s sort of a duplication of effort
here.
[inaudible] for storage class memory we’ve got this really low latency access, we’ve
got byte address ability, but if you go through a file system you lose all that. So
applications don’t get sort of the raw latency that this can offer you. So to that end,
what I’ve been trying to figure out is how can we expose these features to the
operating system and applications better by looking at the interface that they
present. Instead of saying everything must be a file, is there something else that we
should be doing here because files may not be the end-all be-all of data access.
So specifically for any kind of device that has internal management I’d like to expose
those internal management features to applications or to system software so they
can take advantage of them, at least coordinate their activity. And for anything that
has really sort of a low latency, I would like as much as possible to sort of bypass all
the software layers on top of that and really get that low latency right in front of
applications so they can use it.
So with this in mind, I’d like to talk today about two projects that we’re doing. One
is a paper that we presented at [inaudible] called Flashed Here, and it’s one
approach at sort of rethinking the interface to solid-state disks by looking at what
would we do if we were building a solid-state cache instead? If we knew we were
building something just for caching. And then second I want to talk about a paper
we had at Asplus [phonetic] last year on a system called Mnemosyne [phonetic],
which is how do you abstract storage class memory and make it available to user
mode applications?
So the reason I’m interested in Flash as a cache is it’s a really natural use for flash.
So first of all, flash is a lot more expensive than disk. So if you’ve got a lot of data it
could be too expensive to move everything into flash. It’s maybe 10 times more
expensive. But at the same time it’s so much faster, maybe 100 times faster or more
that you really want to get the advantage of that performance. And so a natural
thing to do is to stage all your hot data into a flash device, and then hopefully most
of what you need to access -- I’m jumping ahead of myself -- most of what you need
to access is on the flash device and you can get to it very quickly.
So this is such an obviously good idea that pretty much every operating system,
except maybe Mac OS, has this built in at this point. Pretty much every storage
vendor has a product that does something like this. Fusion IO sells a lot of caches, I
think OCZ has some caches that sort of come with hard drives.
Yes?
>>: [inaudible]
>>Mike Swift: So there is a block-level filter driver that acts a cache called b cache
that sits under any file system or data base and will then transparently send blocks
to either the flash device or the disk. So it’s a lot I think Turbo Boost, is that the
Microsoft caching product, or Ready Boost? I forget which one.
>>: [inaudible]
>>Mike Swift: So I think it’s similar to Turbo Boost, which does transparent caching,
the block level in Windows Vista and beyond, I think. So furthermore, big service
providers -- Yes?
>>: So this doesn’t take any management from the [inaudible]?
>>Mike Swift: So today most people who are doing this don’t take advantage of the
non-volatility. They actually, well in a way they only write out clean data. They’ll do
write through caching. And then most of these systems will basically delete the
cache when you reboot the system. So they don’t actually use the non-volatility.
And I’ll talk a little bit about how we can exploit that. You could do it, but in a lot of
cases people don’t trust the flash devices as much as they trust the disk, despite the
fact that I know there’s data from Intel that shows at least their SSDs are much more
reliable than disks. Intel had a study where they replaced all their laptop hard
drives with disks and the failure rate with laptops went down by a factor of five.
So they actually saved a lot of money putting in SSDs. Anyway, so a natural way to
do this, as I said, is to do it as a block device because you can do this basically
underneath the file system and you don’t need to rewrite all your code to take
advantage of this. You don’t need to modify NTFS or EXT2 to do this. Basically the
file system talks to a generic block layer and then in the block layer you put a filter
driver, which acts as a cache manager. And the cache manager has metadata about
all the blocks that are being cached and when a request comes in it can sort of
decide, is this block located on an SSD or is it located on disk and where should I
read it? So on a read it will consult either one. On a write, if it’s write through, it
will write to both, or it can do write-back and sort of write the data to the SSD, in
which case the SSD acts like a really large write buffer, absorbing new data as it gets
written and then you can later de-stage it back to disk.
So there are a lot of advantages to this, one is that it’s transparent to most software
in the system because it’s at the block layer below everything else, and two you’re
using a commodity piece, an SSD. So you don’t need to go build a special-purpose
device for this. So you can buy any SSD and plug it into your system and take
advantage of this. So for example, some of the hybrid drives out there just have a
stock SSD built into the hybrid drive with some additional technology on top of it to
do this caching stuff. So you can also do this sort of inside the device.
So the downside with doing this though is that it, in my opinion, maybe it’s not just
my opinion, is that caching is actually a really different workload than a storage
workload where you’re actually trying to store data persistently.
So one thing that’s different is the address space. When you are having a caching
device, the address that -- every block that your caching has its own address on the
disk already, which means that when you write it out to the SSD you need to
remember what was the original address that this block was stored at. So that’s
why the cache manager needs to have this table of every block that it’s caching that
says for this disk address this is the corresponding address on the SSD.
Furthermore, if you want to make the disk nonvolatile, you need to write that
mapping table out. And that’s expensive enough that a lot of people don’t both
writing it out. And that’s a big reason for sort of recreating the cache after a failure.
Second is the issue of free space management. So caches already have to manage
free space, they want to use as much capacity as possible and they think about what
to evict when they get full. So similarly SSD’s are doing free space management
because they’re absorbing new writes, they have to garbage collect old data, they
have to figure out what to garbage collect and how to sort of manage the free space.
So again there’s things going on at two layers that could be coordinated. For
example, it might be beneficial if the SSD, instead of garbage collecting data that was
about to be evicted, would just delete the data instead. If you’re going to evict it why
bother the cost of moving it around when you’re not going to use it again?
And then finally there’s this problem of cache consistency and durability. We might
like the cache to be durable across reboots which would allow us to use it as a write
buffer also and sort of take advantage of non-volatility. But because of the cost and
complexity of persisting this mapping, this translation of disk addresses to flash
addresses, a lot of flash caching products don’t do that. For example, I don’t think
the Microsoft version or the Solaris versions of caches are persistent across system
caches.
So the key point here is that a caching workload looks very different than a storage
workload and as a result we might want something different out of the device. Yes?
>>: Are there rewrite workloads in your experience where you have the SSD from
the disk that is actually harming performance? Like does it actually get in the way?
Or does it strictly do no harm ->>Mike Swift: We haven’t seen that yet but I will say there are experiences limited
to a small number of traces from the SNIA website which may or may not be
representative. But I can definitely see if you have something that is just doing a lot
of writes and you’re basically not reading anything, then writing it out to the SSD
really doesn’t buy anything because you’re just going to fill up the SSD and then
write it out to disk and it’s just sort of added expense and latency in the system.
So I think that would be the number one case.
>>: And latency does go up? Can’t do them parallel?
>>Mike Swift: Well, the overhead goes up because you’re spending more software
time managing the cache. So you can do the things in parallel. So to address this we
built a system called Flashed Here, which basically starts with the normal caching
architecture. The first thing we do is we take this SSD at the bottom and replace it
with what we call a solid-state cache, which again is designed to be a commodity
system that anybody can build with a standard interface, but it’s built to be a cache.
So the first thing we do is we have a unified address space, which means that you
can store data at its disk address directly instead of having to translate disk
addresses into flash addresses. Second it has cache aware free space management,
so it knows about which data might be coal data that you could evict instead of
garbage collecting when you want free space.
And third it has an interface where it has commands that are relevant to caching, for
example, the ability to mark data as clean or dirty, or to forcibly evict data from a
cache for consistency purposes. And finally, and something I don’t want to spend a
lot of time on, is it also has strict consistency guarantees that allow you to reason
about which data in the cache is usable following a system crash. Because you can
know for sure that, for example, anything that you read from the cache is always the
most recent version if you follow the protocol.
On top of this we also modify the cache manager to basically use the system
interface and we’ve implemented both a write-back and a write-through policy.
Yes?
>>: So you’re replacing the flash itself here as far as what you get to re-architect
when you do this?
>>Mike Swift: We are modifying the flash translation layer effectively. That
interprets seta commands or skuzzy commands and then figures out what to do.
>>: So only the [inaudible] say, you don’t have within your designing ability, say,
making your blocks a little bigger or anything like that?
>>Mike Swift: We have not done that at this point. We’re sort of trying to just sort
of focus on that translation layer and the commands that come in over the seta
interface.
>>: Is that something that can be modified on [inaudible]?
>>Mike Swift: It cannot be done. As I’ll talk about there’s a prototyping board
where you can write your own flash translation layer. And we’re in the process of
implementing this system on this prototyping board.
>>: But for practical use ->>Mike Swift: For practical use our model is we would like OCZ and Intel to build
devices that actually have this interface into it. That would be our productization
goal. It wouldn’t be something that Microsoft could then do purely in software,
unless there were vendors that let you rewrite their firmware and I think that’s
unlikely. Yeah.
Okay. So I now want to say a little more about the design and I’ll focus on the three
biggest issues, which are sort of the address mapping, address translation, managing
free space, and the new interface -- sort of the commands in the interface. So here
[inaudible] it crocks the problem with address space that we basically have multiple
address spaces going on. We’ve got disk addresses, so the host operating system
has this translation from disk addresses to flash addresses. And then inside the FTL
we have another mapping table that translates flash addresses into the physical
location.
And so we’ve got double translation going on. It really doesn’t buy us anything.
Yes?
>>: [inaudible]
>>Mike Swift: So they have a, they don’t map every single block. They have a small
translation table, more like a look-aside list, that says if you’re on this list. It’s very
much fixized as a limited number of things it can remap. Actually if you look at this
technology, though, there’s this technology called Shingle Writes, that allows you to
basically double or triple your density by overlapping data, which allows you to
write that you can’t overwrite data anymore because you write a track this wide but
you sort of overlap tracks on top of each other like shingles. And so you can read
this narrow strip of it but you write something three times wider.
>>: [inaudible]
>>Mike Swift: Right. So you can’t do intermits and so disks are likely going to sort
of have the same problem of not being able to do random writes anymore without -What?
>>: Do you think shingles are the ->>Mike Swift: So in talking to people at [inaudible] they think that shingles are
going to happen because the reason to use a disk is lots and lots of storage. If you
need performance you’ll get an SSD. And so their take is, you know, the disk will
basically get really, really big and dense and then anything you really need random
write performance from will use flash. And sort of further distinguish the two layers
from each other.
But I don’t know on my own, that’s just what I hear. Okay. So the approach that we
take is to say, well, let’s get rid of the translation table and the cache. Let’s do the
translation table directly in the device. So here you can send a disk address right to
the device and we can translate it from the disk address right to the physical
location in flash. The second thing we do is we change the data structure here.
Instead of kind of a page-table data structure, which is good if you have a really
dense address space, because you have a big page that has lots of translations and
everything is filled in. In a disk this makes sense because the number of addresses
you translate is the same as the number of physical locations you have, where as we
have a system where we have the number of addresses you translate could be ten or
100 times larger.
So we could have like a ten-terabyte disk system that we’re then caching with a
ten-gigabyte cache. And so we have potentially a thousand times more addresses.
So we use a hash map data structure instead, which is optimized much more for a
sparse address space instead. And we have some evidence that about, that sort of
caching workloads are likely to be much sparser than a normal storage workload,
just because the hot data is going to be sort of more widely distributed.
Okay. So the second thing we do is sort of change how we do free space
management. So the normal way that free space management happens in an SSD is
through garbage collection. So you might have multiple blocks here. And remember
the rule is that you can write the sort of small pieces of a block that are called pages,
but when you erase, you have the erase a much larger block that’s maybe 256
kilobytes or bigger.
So when you do garbage collection, typically the process is like in the log-structure
file system where you find multiple blocks that have some valid data. You then sort
of coalesce this valid data into another block. And then you can erase a block and
you get a free block out of it. And so the problem with this is that we basically have
had to read a whole block, write a whole block, we have to erase two blocks because
we’ve erased both of these. We have to start with one empty block to start this
whole thing off. So it’s a pretty expensive process.
Furthermore, every time you copy data, you’re writing to the disk and this sort of
uses up its write capacity because it has limited endurance. There’s a second
problem, which is perhaps even worse, is that the performance of garbage collection
goes way down when you have a full device because you have a lot -- the chance of
these blocks being sort of filled with stale things is much lower when your device is
almost full. So this means that you have to read, you have to sort of copy a lot more
data to make free space when you’re almost full.
It’s sort of like when you defragment your disk and it’s almost full it’s a lot harder
because you have to move a lot more data around. So there’s data from Intel that
shows that at least for their SSD’s if you use them to their full capacity, then write
performance goes down by 83%, as compared to if you reserve 30% of the capacity
as basically spare.
Yes?
>>: A little clarification. What are the small boxes and the big boxes?
>>Mike Swift: So the small -- sorry about that. So the big boxes are what is called an
erase block. And this is the unit of data you can erase. So you can only do erases in a
large block size of like 256k. You can do writes in much smaller granularity, which
are like about 4k or so, sometimes larger. And so the idea is you can write to each of
these individually, but you can only erase the whole thing. Sorry about that. It’s a
good question.
Does that make more sense now?
Okay. So the other problem is that endurance, because we’re doing more garbage
collection, we’re doing a lot more writes to move data around, our endurance goes
down by 80% also. So this means that if you use a normal SSD for caching and you
actually use the full capacity, it may not last very long because you’re doing all this
garbage collection.
And finally, because we’re doing caches, we actually want to have the device full.
We want to get a good hit rate having as much data as possible within the device. So
the approach that we take is to say, you know what? This is just a cache. We don’t
actually need the data. So it might be beneficial sometimes to actually just delete the
data instead of moving it around.
This is actually an idea that I have to sort of credit Jason Flynn [phonetic] for. We
were talking about flash devices and he said, you know, why bother actually storing
data, let’s just delete data instead and things would be a lot easier. And that’s what
we do.
So to do this we have to make sure the device knows which data is dirty and which
is clean. And once we know which data is dirty and clean we can find a block here
that has only clean data. And when we need some free space we can just delete the
data instead of copying it around. And the nice thing is we have now made an erase
block without any extra reads or writes. We just have to erase the one block and we
delete the data.
And if the data is indeed fairly cold, then it has very little cost of doing this. So it can
be much more efficient than normal garbage collection. So we’ve implemented two
very, very simplistic policies here that really don’t even look at usage patterns at all.
But we basically segment the disk, the flash, or the drive into two regions. One is
called the log region, which is where new writes come in and they’re basically just
sort of randomly accessed and things are just written in log order. So this is also to
the hot data that’s being written and here we still do garbage collection because this
data was recently written, it’s likely to be accessed again soon we find, and so
deleting that data really hurts performance because we get a lot of read misses.
This other portion is called the data region and this is colder data that tends to be
stored more sequentially. And here we just evict this cold data when we need space.
And in this model when we evict data we just recycle the blocks to become data
blocks. So here we have a fixized log and a fixized data region. And an alternative is
that sort of a variable sized log where, if we’re having a lot of writes we can make
the log much larger than normal. And this means we have a larger region to sort of
coalesce data from when we do garbage collection. And then we still will do eviction
from the data region but we recycle the block to make a new log block.
And we can also take a log block where if you write all the data sequentially, we’ll
sort of convert it into a data block. The reason this is important is that we can also
do the address translation for these reason at different granularity. So this is sort of
randomly written new data we translate this at 4k. And this is sort of much more
sequential data. So we translate this at an erase block granularity of 256k. And so
this saves on the memory space needed for translation by partitioning things this
way.
And it’s a fairly common approach from what I understand in real solid-state drives
too.
Yes?
>>: Any [inaudible] policy for bringing a block first time into the cache? Is it like
when it is first accessed?
>>Mike Swift: So we have two --I’ll get to that in a coupe of slides. Yeah, when I talk
about the cache manager. The cache manager implements the policy of when to put
data into the cache. Okay. So the next thing is the caching interface, which is what
do we need to add to a solid-state drive to make it a better cache?
So what we need is first some way to sort of do cache management to identify,
which blocks are clean and which are dirty so we know what we can evict. And we
also need some way to sort of ensure consistency, so for example, if you write data
directly to the disk and bypass the cache, you might want to forcibly evict data from
the cache so that you’re sure that it never stores a stale copy of data.
You also might want to change the state of a block from being dirty to being clean
when you do write it back. So to address this we basically take the existing read and
write block interfaces and sort of tweak a little bit. First of all we have two
variations on write where you can write data back and say that it’s dirty, meaning
that you shouldn’t erase it on purpose. Or you can write it back and say that it’s
clean, which means there is a copy somewhere else in the system, and if you need
free space you can erase it.
For read, the only thing that we do differently is that we now return an error if you
read an address that isn’t there. So normally if you have an SSD and you read an
address that has never been written to it will just make up some data and give it
back to you. Typically zeros or something like that, but it could return whatever
was written last time at that physical location. There’s no specification for what
happens when you read unwritten data.
Here you know for sure that something isn’t there. So the nice thing about this is
that if you want to access the cache, it’s always safe to just read a disk address off
the cache. And if it’s there you’ll get the data. If it’s not there you’ll get an error.
You’ll never get invalid data.
And then finally, for cache management we have two operations. One is evict, which
invalidates a block. And the second is clean, which basically marks a dirty block as
clean.
>>: So it’s a little bit of a trade off with the read interface because I’d imagine
[inaudible] and if you miss you just [inaudible]. It just means that ->>Mike Swift: So in this system we assume that this is sort of a stand-alone drive
and it cannot go to the disk itself. So it’s not a stacked interface. The cache manager
will first go to the SSC and if it’s a miss will then go to the drive. So that piece is all in
the software. And the reason we want that is because it allows you to sort of plug
this into any drive system you want. You can put it in front of a raid array. You can
put it in front of a network file system. It can work with any form of block storage.
>>: It seems like you keep just small bit of state in memory in [inaudible] to know
whether or not the read exists.
>>Mike Swift: Yes. So the nice thing is -- that’s a great point -- is that this is always
correct. Which means that you can have imprecise information in memory, sort of
like a bloom filter or something else and approximate data structure that says is it
worth checking the cache? And if it is you can go and you’re guaranteed to get the
latest data or nothing. And if it says it’s not there you can go right to the disk and
avoid the latency of going into the cache.
>>: Since it’s so cheap to go to the cache would you want to speculate what disk and
then cancel the IO?
>>Mike Swift: You certainly could. We haven’t looked into it. I know from talking to
some people at [inaudible] that they really do worry about the cost of cache misses.
And so they spend a lot of effort in not adding latency in that case. So that would
make sense. Yes?
>>: [inaudible]
>>Mike Swift: Right. This is all implemented inside the SSC. So these are the
commands that are being sent over the seta interface into the SSC instead of the
normal read block write block.
>>: So if you had more memory on the SSD type memories, opposed to flash, would
that change the way you store the data structures? Could you do more if you had
more competition?
>>Mike Swift: We certainly wouldn’t go with this mechanism of sort of having the
separate data region mapped t coarser granularity, we would not do because there’s
a lot of overhead and sort of translating between 4k blocks and 256k blocks. But I
don’t think we’d necessarily change it. As you’ll see, we do expand a little bit the
amount of memory in the device because we now have a hash table instead of a page
table, which is a little bit less inefficient. And we also have to start metadata about
the blocks. So we have to store a bit that says is this block dirty or clean?
If we had more data I think I’d add space for things like last access time also. But
already we keep all of the metadata about all the blocks in memory, so that like a
read miss is really just a memory operation on the SSC and it doesn’t actually have
to go to flash. So that’s why I wouldn’t change too much.
Yes?
>>: [inaudible]
>>Mike Swift: So we thought about that. And in talking to various people, and
particularly Fusion IO, one of the things they said is they really find it’s important to
have some management software on the device itself. And the reason is that the
flash chips themselves are so crazy and hard to use that things like what is the
frequency with which you reread a block or handling ware management is really
important. And they find that by having this on the device they can actually improve
their write endurance by a factor of two or three just by doing a better job on the
device itself.
And so for that reason we want to have at least some management code on the
device. And our other goal was -- you know I think there’s a huge benefit to having a
standard interface that you can just plug a device into, instead of having to write a
device drive for operating system. And so we try to come at it with an approach of
let’s not require a lot of OS code, as much as I love drivers and writing them, but I
think that most storage vendors would rather have a device you can just plug in
with no required kernel-level software, particularly for Linux where you have to
rewrite it every three months for each new version of the kernel interface.
So I think that’d be a great design but we sort of hose a different set of constraints.
So finally we have an exist interface. So there’s a problem that if you do crash and
you come back and you need to know which blocks are dirty so you have to clean
them again. And so we have a command that allows you to probe for the existence
of dirty blocks.
So on top of all this we have a cache manager and we’ve implemented two caching
policies so far. The first is write-through, where the cache manager stores no state
whatsoever. And basically every write is written both to the SSC and to the disk so
we cache all writes. And then on reads, every read goes to the SSC and if we miss we
go to the device and we populate it. So we have sort of the simplest possible caching
strategy here. Yes?
>>: In terms of the dirty blocks, I assumed you [inaudible]?
>>Mike Swift: So we actually -- this is the piece that I said I wasn’t going to talk
about the consistency. So we actually do internally use sort of atomic writes
strategies like the atomic writes from Fusion IO where they use ->>: [inaudible]
>>Mike Swift: We don’t use a super cap. We basically, we don’t assume we have a
super cap but we actually do synchronous logging of metadata updates when it’s
important. And so that’s part of our performance model. And for writes we do
assume that the device, Fusion IO has this idea where you can basically write a bit
that says this is -- because you’re writing everything to new data you can actually
detect a [inaudible] write because you can see have you ever overwritten this or is it
cleanly erased? And so we assume we can do atomic block writes using that.
That’s a great question.
For write back it’s very similar. Reads are exactly the same, where we try both. On
writes we will write it back the device and we populate a table in the cache manager
that says this is a dirty block and we have sort of a fixized table here. So if we have
too many dirty blocks, then we’ll start cleaning blocks and writing them back. And
in our experiments we set this to 20% of the total size of the cache. And then finally
when the system crashes, we use the exists primitive to sort of go and reread to find
out what all the dirty blocks are.
Okay. So that is sort of the essence of the design. Now, I want to talk about our
evaluation. And our three goals for this are, you know, performance is obviously a
good one, reliability, and then is there some efficiency savings in terms of memory
consumption?
So we implemented this. We took the Facebook’s flash cache manager, which they
use in production to cache data in front of their my sequel data bases. We took a
flash timing simulator that was published by Kim, et al. We do trace base simulation
on a number of different traces from the storage network industry association. And
here are sort of the parameters for our device.
We more-or-less tried to set the parameters about like an Intel 300 series SSD,
which is sort of a medium grade, latest generation SSD. So here are the traces. We
have two traces at the top that are very, very write-intensive. They’re 80 to 90%
writes. And we have two traces that are very, very read intensive that are sort of 80
to 90% reads, and so this sort of covers different aspects of their performance. Yes?
>>: [inaudible]
>>Mike Swift: The blocks are 4k blocks. So you can see this is a really small
workload. And so what, actually I should mention this, that we size the cache to be
able to hold one-quarter of the total number of unique blocks. And so we assume
we’re going to cache the top 25% of things. And so the kind of things, to be honest,
we could set this any size and get any performance we wanted. At one point the
student made it so big that everything was in the cache and performance looked
really good. And I had to say that doesn’t actually prove anything when everything
is in the cache and you’re not running any of our code.
So we figure 25% is reasonable. We get a miss rate that’s measurable so that we can
see the impact on misses. So here is the performance numbers, in red is the SSD.
We then have the two variations, sort of the two garbage collection or silent eviction
policies, SSC and SSC-V with write through in blue and then write back in green.
So here are the write intensive workloads. And this is where performance really
gets helped because of the silent eviction policy. So what we can see is that
particularly for write back, we get about 168% performance improvement, or 268 -no 168% performance improvement over the baseline system. And this is because
we’re doing a lot less garbage collection. So we’re getting a lot of writes here and
then once it’s being cleaned we can just delete the data instead of moving it around.
>>: [inaudible]
>>Mike Swift: Right. It’s the baseline system that is currently in production at usage
at Facebook.
>>: [inaudible]
>>Mike Swift: Yes. So we’re still using cache and this is not compared to a disk, this
is still with an SSD, sort of the same performance model of SSD as a cache. And you
can also wee that for a write-heavy workload, the write back definitely out-performs
write through because we’re basically using it as a write buffer. So if you overwrite
something it’s sort of one less write that has to go to disk, which is pretty much what
you’d expect.
For read performance, you know performance really I say this is unchanged. And
the read path is basically the same. We’re not going to make flash act any faster.
We’re still translating addresses, so that doesn’t help. So what this really measures
is what is the impact of our really simple eviction policy, where we’re willing to evict
any clean block that we’re storing. So it’s basically a random eviction policy, and we
can see I think we have about a 10% miss rate, that even then we still get a very
good hit rate on the read-heavy workloads, even though it’s so simple. Yes?
>>: On the [inaudible] because flash is better for media, you probably want to have
an [inaudible] policy where you enter the block in the flash cache only after it
proves itself to be hot enough.
>>Mike Swift: Right. I think that would be a good idea because we could definitely
bring it in the page cache in the operating system, and then if it’s referenced a
number of times, then write it back into the flash cache, instead of just putting it
there on the first access. So I think that’s a great idea.
>>: Didn’t you also have a double translation?
>>Mike Swift: We do get rid of the double translation but that’s not a big expense.
In the host memory it’s a simple hash table look-up that takes, like, 20 nanoseconds
or something. And our translation we timed the speed of using a hash table instead
of a page table, and we see that it adds like five nanoseconds onto the time to do a
translation.
>>: If you knew that you had to put this on the firmware, on the flash, on whatever
persistently [inaudible]. The performance tried also things like looking up the page
table versus looking up the hash table made a big difference.
>>Mike Swift: Right. So we are currently implementing this there. And I agree that
a low-end arm process -- I think the device we have is a low-end arm processor and
so I do think that there might be a bigger difference. But, you know, even we were
10 times slower, the cost of that translation is so much less than the time to actually
access flash, that I don’t think that’ll add a noticeable amount of latency. It might
mean though, that we do need a slightly beefier processor to keep up with the
translation.
>>: And do you know, I mean this is some sort of prototype, which you’re using, I
don’t know if the actual production model where [inaudible].
>>Mike Swift: This is a commercial SSD controller. So it’s the Indilinx controller
from, I forget. I think Indilinx is a company that makes controllers that OCZ and
Intel use in their products. And so this is a commercial controller but we can
provide the firmware for it. Do you have a ->>: I’m just curious. How often do these reads get [inaudible]?
>>Mike Swift: So this is a trace replay. So this is basically taken beneath the buffer
cache. So this is just the thing, the traces are just the things that already went to
disk. So when we have the real prototype working, we’ll have a lot more
information about how this interacts with the buffer cache.
Yes?
>>: [inaudible]
>>Mike Swift: We are using some of those, I believe.
>>: So those are unmodified applications? So they already have large RAM caches.
So [inaudible].
>>Mike Swift: Right. That is totally true. And, you know, I don’t think using a flash
cache would necessarily change the fact that they’re using a large RAM cache,
because there would still be two orders of magnitude better performance from
using a RAM cache instead. But it is true that if we did reduce the amount of caching
in those applications we would see a different trade-off here. Definitely.
>>: [inaudible] applications have to use RAM cache because their only other
alternative is hard disk. But with the other flash cache maybe they don’t need that
big of parameter.
>>Mike Swift: Yeah. I think that’s a great point, but it’s not something we’ve
investigated yet. But I do think the -- we looked at a work on using flash for virtual
memory and we found that when you’re swapping to flash instead of disk you could
dramatically reduce your memory usage, in some cases by 80% for that exact reason
because the cost of a page fault was so much less that you could basically swap a lot
of your data out instead of keeping it in a DRAM.
And I think definitely the same thing would apply here.
>>: One more question. [inaudible] from flash to hard disk you tried to take
advantage of a sequential[inaudible] to hard disk in terms of choosing which blocks
to emit?
>>Mike Swift: So we do not at this point. So basically we do whatever. We’re using
Facebook’s flash cache software to choose which blocks to evict. And so my guess is
they already tried to keep it sequential but we didn’t modify that. In our
performance model we basically assume that all disk accesses take half a
millisecond. And so it’s a very simple performance model.
We’re not sort of looking at the locality of the disk right here. But that would
definitely be an extension. Okay. So the second thing we looked at is endurance.
Does sound eviction help with endurance because we’re not doing as much garbage
collection? So here we show, what we measure is the lifetime, in years, of the
device. So this is a device sized to hold 25% of the unique blocks.
We look at the duration of the trace to calculate the frequency of rights and then
therefore the frequency of erases, and we figure out at this level of workload how
long would the device work? Assuming that you could do 10,000 overwrites per bit
sort in flash. And so what we see is that there is a reasonably good improvement in
lifetime here because of silent eviction because we’re not doing garbage collection.
And in the mail case I think it goes from a place where really one and a half years for
a device is probably too short from a management perspective to want to use
caching up to 2.8 years where maybe it would be worthwhile because you wouldn’t
be replacing the device.
And I will note that this is a very small workload and so you take it with a grain of
salt. In the case of reed caching we see that here the first most important thing is
because this is a really big workload, endurance isn’t a problem. There’s just a lot of
its to spread writes over and there’s not that many writes, so endurance is very high,
200 years or something like that, and the endurance does go down a little bit.
And the reasons for that because we’re evicting data that gets [inaudible] reference
we do have to read in data and do extra writes we wouldn’t have had to do
otherwise. But because it’s so rare relative to the size of the device it doesn’t impact
lifetime that much.
>>: [inaudible] ->>Mike Swift: Yes. So we do garbage collection sometimes. But if we go back here,
this performance drop here is because of the hit ratio. So we’re losing about 2% of
our performance because we’re not hitting in the cache as much. And I think if we
had an eviction policy that was more sensitive to usage that actually used recency of
access instead of random, we would improve this.
And so the final thing we look at is memory usage. And here for the write-intensive
workloads we have in blue the amount of memory on the device used to hold the
translation table, and then the amount of memory used in the host to hold whatever
data it needs.
And here we only look at write back, because with write through the cache manager
didn’t store anything, so the red bar would be zero. So the ratio of these things is
the same across all workloads. So we can just take one for example, and what you
can see is that with the normal SSD there’s a small amount of data in the device, this
is very compact it’s basically a page table containing every block on the device. But
the host has a lot of memory because it has to store a hash table that keeps track of
every block stored there.
In the SSC case, the device memory goes up because we’ve gone from a page table
structure to a hash table, which is slightly less dense, and so we increase memory
usage in the device by about 10%, but the data usage on the host has gone down by
80% because we’re only keeping track of the dirty blocks and not all the clean
blocks also. And that holds across all the different workloads. Andrew.
>>: So if I’m going to build one of these things how do I decide how much memory
to put in it? If I’m going to buy one do I have to trade off [inaudible]?
>>Mike Swift: So what you would say is when you build the device you know how
many blocks it has physically, and you need a hash table that can hold mapping for
that many blocks, and that’s how you would size the memory on the device. So in
the host, similarly in the host you’d want to reserve enough memory for whatever
your fraction of dirty blocks is. So remember ->>: [inaudible]
>>Mike Swift: So the reason is we size the cache differently for each workload
according to the size of the workload. So this is sort of a one-gigabyte cache and a
five-gigabyte cache and a 100-gigabyte cache or something like that. So that’s why
the amount of memory you need is different, because the number of blocks you’re
caching is different for each workload.
That’s a good question. Okay. So as I mentioned, this was all done in simulation.
This is the board that we’re currently using. It’s the open SSD prototyping board,
and it has this Indilinx [inaudible] SSD controller. It has some memory, some flash,
and you can basically write your own FTL to go into here. It comes with three FTLs
and it performs relatively well so that I think it provide useful results. And so my
student has implemented the stuff, and he says that it’ll all work, but he hasn’t
actually tried it yet. So we’ll see what happens.
But it’s nice because I think until this year it was very difficult to actually build an
FTL and do anything. I know Microsoft research had a big effort to build something
on the b-board, the b3 board or something, and this is a lot easier to just use a
commercial controller. And I think it’s probably more representative. It’s limited to
what you can do because everything is sort of commodity parts, but at least the FTL
research this works pretty well.
Okay. So in summary, the take of this research that sort of making flash look like a
disk is great for compatibility, great for getting devices out there, but there’s a lot of
neat things about flash that get hidden when you do this. And one thing we can look
at is, are the there better interfaces? So we’ve looked at one interface, which is
really designed for caching, which makes it easier to write a cache and perform
better, in some cases reduces memory consumption.
And, you know, I think there’s other opportunity to look at other interfaces-to-flash
sort of tailored to specific applications, or even an interface-to-flash that sort of
subsumes this and handles other applications as well. Yes?
>>: [inaudible] do you use the non-bulleted profit to do fast recovery like instant on
cache?
>>Mike Swift: Yes, we do. So we have a write back and a write through. So in all
cases actually the cache is non-volatile and so as soon as you turn it on everything is
accessible in cache. And I have a separate slide, I don’t know if I have it with me,
that shows the recovery time is half a second or something like that, whereas the
Facebook flash cache, which had a non-volatile option, took about 30 seconds to a
minute to recover, because it had to reload all the metadata into memory.
So we do have that capability. Yes?
>>: Earlier you talked about the limitations of flash because you plan to leave 20 to
30% empty.
>>Mike Swift: To get high endurance.
>>: To get high endurance, yes. So have you done experiments where you could, or
do you with this system you could keep the flash 90% full and not take the big
[inaudible] hit, or is that just fundamental?
>>Mike Swift: So I think it depends a lot on the workload. But what we can do, I
think is more interesting, is we no longer have a fixed capacity device. We can
decide for a workload how much of it can we actually use for live data. If we have a
very static read-intensive workload where we’re not doing any writes, we could use
100% of the capacity because we don’t need to reserve anything for incoming
writes.
So there’s one copy of every block. So you can use all of it. If we have a very writeintensive workload, what we want to do is absorb writes as quickly as possible, like
a write buffer, and coalesce things that are overwritten. And then we might use only
30% of the device or 50% of the device. So really what we want to do is size the
device to optimize performance, rather than hit rate or something like that.
>>: Facebook’s cache they do the same thing. They can use 100% [inaudible] right?
>>Mike Swift: Well, the device internally actually reserves 7% for logging purposes
and things like that. Yeah. So but you can definitely statically, I mean Intel
recommends if you’re doing a cache you should statically reserve, create a partition
that is 30% of your device and not use it more-or-less, to get optimal performance.
And this would allow you to dynamically react to a workload.
Okay. So I think I have about five or 10 more minutes or so. So I’d like to go on. I
will try to not go through this fast, but I’ll try to cover the highlights in the time that I
have remaining. So the second piece is looking at what do we do with storage class
memory. It’s a really great technology that can be the dream of system architects of
having persistent memory that’s really fast, perhaps. And the question is how
should we expose it to applications?
So if we look at what we did with disks. With disk we typically have a pretty deep
stack of software that we have to go through before we access the disk. And if you
think about it, there’s actually a really good reason for this. So at the bottom we
have devices that have no protection mechanism. So the disk can’t tell which
processes are allowed to access which blocks.
So we need to go through the operating system, which actually implements
protection through file system [inaudible]. Similarly, we use DMA to access it. DMA
operates on physical addresses. You don’t want user mode code doing DMA. It’s not
protected. Again, we need the kernel to handle the data transfer.
We have a device with incredibly variable latency, and so having a global scheduler
that can reorder things can really optimize performance and get you ten times
better performance if you do things the right way. Similarly, because it’s very slow
having a global cache where we cache commonly used data across processes makes
a big difference.
So for disks this is a great architecture and I think it’s worked really well over a lot
of years. If we look at using storage class memory, we can see that things change a
little bit. So at the bottom, if we map it as memory, we have hardware protection.
So we can use virtual memory protection bits. So we no longer need to go into the
kernel to implement protection.
Because we’re accessing things directly with load and store instructions, we don’t
need a device driver that mediates all access and so we can get rid of the need for
sort of a global device driver. Because we have more-or-less constant latency, we
probably don’t need an IO scheduler to reorder requests, and so we don’t need
another piece of global thing.
And then because the latency could be as good as DRAM, we may not need as much
sort of shared caching across processes because it’s just not that expensive to go
fetch things directly from SCM. And so what this means is that we think that all
these things that said file systems have to go into the kernel to access data may not
be true with storage class memory anymore.
And so what we set out to see was could we build a way that you could expose SCM
directly to applications as memory using memory protection to control access? So
to address this we build a system called Mnemosyne [phonetic]. The student who
did this is Greek, and so we were told to not use his name because it was too hard to
pronounce. But it turns out that the Latin word for Mnemosyne is Moneta. And
Steve Swanson’s group had a -- their PCM prototype disk was called Moneta. So that
was already taken.
Clearly we were consulting the same source of names.
>>: There’s another [inaudible] with this name from STI ->>Mike Swift: Okay. Yeah, right.
>>: [inaudible]
>>Mike Swift: Right, right. Well, Mnemosyne is the personification of memory,
which is a great -- we tried Indian words also but we couldn’t find one we were
comfortable with. Anyway, so the goal here is that first we have a persistent region
in your address space where you can put data that survives crashes. And then we
also have a safety mechanism so that if you’re doing an update in that region of
memory and you crash you don’t get corruption. Kind of like filing systems do
journaling and shadowing updates to prevent corruption.
So the basic system model we have is the picture I showed before. We’ve got DRAM
and SCM both attached to your system. In your address space we now have regions
that are volatile, regions that are non-volatile. You can put data structures in them.
If there’s a crash and something bad happens the volatile data goes away, but your
non-volatile data survives. So this is the basic mechanism we want to give
programmers.
So on of the key challenges here is what if you’re in the middle of doing an update
like inserting into a linked list, and you’ve got your update partially done and
something crashes before you fix the data structure. When you restart the system
you’ll find that your data structure is inconsistent and you either need to walk your
data structure to figure this out, or you may suffer with incorrect data. So this is
what we would like to help programmers with as part of this.
Yes?
>>: Is it still slower than DRAM?
>>Mike Swift: It is a little bit slower than DRAM, and it’s getting slower than DRAM
every year. It’s getting slower faster than DRAM. DRAM is getting faster, faster than
it is getting faster I guess.
>>: [inaudible]
>>Mike Swift: Yes. So actually I talked -- I agree there are definitely some system
integration issues. I’ve heard the processor pipelines do really poorly with variable
latent, long variable latency events. So if you have a write that takes a microsecond,
you know, your pipeline is not built to have things that are outstanding for a
thousand cycles or two thousand cycles. So integrating this into the system is not -you know when we started this work PCM looked a lot more promising, and the
speeds were going to be half the performance of DRAM instead of a quarter of the
performance.
And so it made more sense. But we still -- you know, the read performance I think
still looks pretty promising. It’s the write performance that really would be an issue.
And the way we handle writes we actually sort of encapsulate the write separately.
So we can actually do the writes off the processor pipeline. Well, I’ll show you how
we do that to sort of not affect other instructions.
But I agree that it’s definitely an issue. What?
>>: [inaudible]
>>Mike Swift: Um ->>: [inaudible]
>>Mike Swift: Is there what?
>>: The reads are comparable?
>>Mike Swift: The reads are projected to be a lot closer to DRAM reads. The writes
are projected, compared to what the writes are. Currently the projections in writes
are like a microsecond and reads are like 200 nanoseconds I think. Ed could
probably tell you more about it if you have questions.
Okay. So our goal is basically to make this consistent. So the first thing we have is a
program abstraction, which is basically a persistent region where you can declare
variables, global variables, to be static p-static variables, which means they’re like
static variables that survive multiple calls into a function. These variables survive
multiple implications of the program.
And this is really designed for sort of single instance programs like Internet
Explorer or Word, where you’re only allowed to run one copy of it at a time. And if
you try to start a second copy it will sort of kill itself and redirect you to the first
copy.
So this is a pretty standard model for applications that manage their own persistent
state to begin with. We then also have a dynamic way to create this using p malik
where you can basically do heap allocation. And our goal here is that you can
allocate some data with p malik, hook it in to some kind of data structure, and then
the p set of global variables are how you name this data and find it again after a
reboot.
So that’s sort of your name, this is how you name persistent data and get access to it
again. So the key challenge here is how do you update things without risking
corruption after a failure? And so here is an example of the kind of thing that can
happen. So suppose you are writing, you have a data structure that has two fields, a
value field and a valid field. And the invariant is that the new value shouldn’t be
used unless the valid bit is set.
So suppose you first set the new value. The old valid bit was zero. So we know this,
for example, is invalid data. We then set the valid bit. So now in the cache we see
we have a new value and it’s valid. The challenge we have is that the data only
survives a failure if it actually gets all the way back to SCM. If we crash and the data
is in the cache it gets lost. So this is a lot like your buffer cache of disk pages.
If they’re in memory you don’t get to keep them, if you write them to disk, you’re
good. The challenge here though is that we don’t get to control when things get
written back. The cache itself decides when things get written back, and it can write
them back in any order. So the processor cache may evict this valid bit before it
evicts the value bit. When we have a crash failure now, what we see is that we have
inconsistent memory in SCM and our invariant is broken. Andrew?
>>: This sounds a lot like [inaudible] processor consistency.
>>Mike Swift: It is.
>>: Can you have things like [inaudible] ->>Mike Swift: Yes, you can. It’s the same basic ordering problem that you have to
solve, except instead of being visible to other processes, it’s basically being out to
memory. Exactly. So the basic thing that we need is a way to order writes, because
ordering writes allows you to commit a change. If you have a log you can say the log
record is done. If you have a transaction you can say the transaction is committed, a
shadow up that you can say here’s the new pointer.
So we need a way to sort of order writes. And our goal is to do this as much a
possible without modifying the processor architecture. There’s things you could do,
for example, there’s some work that I think Ed was part of, that looked at can we
introduce epics into writes the cache that say this is the order in which things must
be evicted to preserve this ordering. And our approach instead is to say, well,
fundamentally we need is a flush operation, which says, force something out to PCM
and then a fence operation that says wait for everything that’s been forced out to
memory to actually complete.
So it’s a lot like asynchronous IO where you’re doing an F-sync and then your
waiting for it to complete. And with this we can sort of order any pair of operations
and make sure that they actually become persistent in the right order. So if we do
this we sort of can make sure that we forced this value our before we actually forced
the valid bit out.
So if you’re not familiar with these instructions move NTQ is a non-temporal store.
It’s meant for streaming large amounts of data out. And it isn’t ordered with respect
to normal store. So it doesn’t obey TSO or sequential consistency. And so it works
better, I believe, for long latency operations. And it basically is also write combining
so it’s the best way to push a lot of data out to memory. And then CL flush just
flushes a cache line. So this is useful if you’re kind of updating a data structure it the
cache and then you want to sort of commit it at the end. This is good if you know
that you’re writing a log and you want to force your log out, and you’re not sort of
writing it in pieced or something like that.
Okay. So we have these primitives that allow us to do ordering. And on top of this,
we think this is useful but it’s not what programmers want to use. So the first thing
we built was a set of data structures basically we have a persistent heap we use for p
malik that allows you to allocate in free memory and it keeps track persistently of
what’s been allocated, what’s been freed. We also have a log structure here that
handles things like torn writes.
So if you’re in the middle of writing to the log and you crash, we can tell which log
records are complete and which ones are not complete with very high speed. And
then on top of this we have a general-purpose transaction mechanism where you
can put arbitrary code within a transaction and whatever within that code is
referenced, as persistent memory will actually be made atomic and consistent and
durable if the transaction commits.
So I’ll show you a little bit about how this works. So here we actually just reuse
existing transactional software memory technology. We use the Intel software
transactional memory complier so you can put atomic keyword into your program.
Here we can put our two updates in there. The compiler will generate instrumented
code with calls out to a transactional runtime system to begin and commit the
transaction. And then every store will also be instrument with the fact that this data
was updated and what value was written out there.
And there’s a runtime mechanism that implements a write-ahead redo log. So
everything that gets stored is written to a redo log, and then in commit we sort of
force the log. And then we can lazily actually sort of force out the data itself because
we have this log around.
So by doing this we get sort of full asset transactions for memory. And the nice thing
is this also, because we’re using transactional memory, there’s a standard problem
in persistent memory is what if you have a lock in persistent memory? And you
write out the fact that something was locked and then you crash. How do you figure
out to sort of clear all those locked bits afterwards?
So the recover virtual memory from [inaudible] had a special mechanism to fix all
this. But transactions avoid the need to actually have any sort of state about what’s
locked within a data structure itself. So here is sort of a high-level picture of how
you might use this to sort a hash table. We have a static level variable that we can
use p malik to allocate space for. One thing to note is that p malik takes the address
that you’re allocating memory to as an out parameter. And this is important
because there’s a race condition where you might allocate memory and then crash
before you assign it to your static level variable.
And making an out parameter means that internally the allocator uses a little
transaction to allocate the memory and atomically assign it to the pointer. So we
can guarantee you won’t leak data if you crash in the middle of this. To insert into
the hash table you basically begin a transaction, you can allocate some data, put in
the key end value, insert it into the hash, and if you crash at any point within this the
transaction system will sort of wipe away everything and make sure that when you
restart all of these changes don’t show up because you never committed the
transaction.
So that’s kind of the big picture of what we would like to enable. So our
momentation [phonetic] is pretty obvious what you might expect. So evaluation, we
did not have any SCM available or PCM to do this. So we have a very simple
performance model. We’re really focused on the latency of writes. Does this make it
faster to write data?
And so we basically, because all the writes are going through a transaction
mechanism, or through our primitives, or our data structures, we basically
instrument all of the writes to SCM with a delay, a [inaudible] delay. And we default
to a very aggressive 150 nanosecond added latency on top of accessing DRAM. And
then for comparison we compare against running EXT 2 on top of a RAM disk.
EXT 2 is pretty lightweight, so it performs pretty well in this configuration. And also
here we’re largely looking at data within a single file and not metadata operation.
So it’s really just moving data -- sort of doing mem copies back and forth. It doesn’t
really stress the metadata, which is where a better file system might have an impact.
So our micro benchmark is just looking at a hash table. This is a comparison of a
hash table my student downloaded off the Internet and made persistent using
transactions against Berkeley db, which is the thing that everybody in the open
source community uses for comparison. What we compare here is we have
Berkeley db and we look at three different granularities of data, an 8-byte value, a
256, and a 2k value. And we have three different latencies for PCM, 150
nanosecond, 1 microsecond, and 2 microseconds.
And what you can see overall is that for small data structures or small updates and
for fast memory we do really well. And it’s because we really get very low overhead
access to memory. As we get to longer latencies, all the optimizations that Berkeley
db does for disk about sequential writes actually help because PCM looks a lot more
like a really fast disk at this point.
Similarly, for small data structures, the transaction mechanism is pretty fast. We’re
copying small amounts of memory. But for larger data structures sort of copying
things a byte at a time through the software transactional memory is less efficient
than doing big mem copies like Berkeley db. And so we don’t perform as well at that
large case. So we’re really optimized for sort of fast memory and very fine-grain
data structures.
So I will just sort of show the highlights of an applicational workload and the net
result is for applications that are really sort of storage bound we can be a lot faster.
This is a program that uses m sync to force an in memory data structure out to disk
and we replace m sync calls with transaction and it’s a whole lot faster.
This is a lightweight directory server kind of like the active directory. And here we
have some performance benefit but not as much. And mostly because the disk is so
fast, or the PCM disk is so fast that there’s not that much to speed up in this case, but
it still does make a difference.
So any questions on Mnemosyne?
Okay. So I want to say a little about where we’re going with this. So Mnemosyne is
really focused on a single application with a single or a couple regions of memory.
And we want to see can we extend this to the whole file system? So Ed did some
great work building a kernel mode file system optimized for storage class memory.
And we want to say, well, if we can take storage class memory and map it into your
address space, can you actually sort of access the file system without calling into the
kernel? Can you just sort of have your user mode programs directly read file system
metadata out of memory and find data on their own?
So we’d like to be able to support sharing, where you have multiple programs
sharing data, perhaps at the same time. We want protection so we can control who
can do this. And we want to rely on memory protection to implement this. And we
also want naming where you can have a normal hierarchical name space to actually
find things.
And the benefits we see is one is performance, because we can avoid calling in to the
kernel, which has some overhead, and second flexibility, where it could hopefully be
a lot easier to design an application specific file format, because you could
potentially just writing code at user mode, where hopefully it’s an easier
environment. You can sort of customize the interface or layout for your application.
So we’re working on a prototype and this is what it looks like. In the kernel we have
what we call the SM manager. And all this does really is allocate memory, pages of
memory or regions of memory. It handles protection so that it keeps track of which
processes can access which data. And it handles addressing by mapping this data
into user mode processes address space. And that’s all it does.
There’s nothing specific to a file system here. We then have a trusted server which
implements synchronization in sort of a lease manager. This is a lot like [inaudible]
if you’re aware of it. Yes?
>>: [inaudible]
>>Mike Swift: Yes. I’ll get to that in a second. So we have a lease manager that
implements sharing. So this is a distributed lock manager, pretty straightforward.
We have a trusted server here, which handles metadata updates. And this is the
only thing we really centralize because we can’t trust individual applications to
correctly write metadata because they could corrupt the metadata and affect
everybody else.
So we centralize those things. The applications though, have a shared name space
here, and they access it by basically figuring out which page of memory holds the
superblock more or less, and then reading and parsing the file system metadata
structures right out of memory and finding whatever data they want.
And so far we have no caching whatsoever here. We’re just, every time you open a
file, we go and more or less reread memory to figure out where to find it. And the
nice thing is that, basically, if you doing a ready only workload, the application can
run completely within its address space without any communication at all, except to
acquire some locks initially to prevent anybody from modifying that data.
So it has the potential to be a lot faster and to reduce communication in a
multicourse system. So we have some preliminary results. Using file bench and a
file server profile we’re about 80% faster than calling into a kernel mode file system.
We also implemented a customized key value file system where the interfaces,
instead of open read write close, is really just get and put.
So we’re bypassing creating any transient end data structures. And we also sort of
support fixized files, like, suppose you’re storing image thumbnails or something
like that. And here we’re about 400% faster in a web proxy profile. So this is very
preliminary and of course we’re not doing really complicated access controls like
[inaudible]. But it kind of gives us a sense that there is something to be had here.
So in conclusion, I think these new storage devices really have a lot of new
capabilities. And making them look just like block devices conceals a lot of that
interesting features. And the goal of my work is to really figure out can we find new
ways of accessing those devices or making them available where programs can
actually get sort of direct access to their features and benefits.
So with that I will stop and take any remaining questions. And thank you all for
coming.
[applause]
Anything left?
>>: It appears not.
>>Mike Swift: Okay. Great. Well, thank you all.
Download