1

advertisement
1
>> Jeremy Elson: Welcome, everyone, thanks for coming. It's my pleasure to
introduce Anirudh Badam, who is here today from Princeton. He'll be telling us
about how to use SSDs and other sundry devices to bridge the gap between memory
and storage. And just in case you don't find that interesting enough, when you
get a chance to talk to him, he also enjoys rock climbing and reading books
about modern physics. So feel free to question him ruthlessly on those topics
as well. Thanks a lot. Go ahead.
>> Anirudh Badam: Thanks for the introduction, Jeremy. I'm Anirudh Badam.
I'm from Princeton University, and I'm going to be talking about bridging the
gap between memory and storage today.
So the motivation behind the talk is that over the last two or three decades,
the capacity and performance gap of memory and storage are increasing. These
gaps are in terms of, not only in terms of IOPS per second, random IOPS per
second, but they're also in terms of capacity.
So if you look at state of the art memory units today, they can do anywhere
between a million or 1.5 million IOPS per second. And if you look at state of
the art disks, they can do between 100 to 300 seeks per second. There are some
that do slightly higher as well.
And with respect to capacity, probably can get somewhere between 200 to 300
gigs of DRAM on a [indiscernible] basis today, and that's the maximum that you
can get. I mean, there are capacity constraints for really high density DRAM.
Essentially not only the -- not only the logic board needs to be capable enough
to have high density DRAM, but the high density DRAM is also expensive and does
not [indiscernible] beyond a certain point at every point of time.
There's eight terabyte of disk per rack unit is probably conservative, but you
can get as much as maybe 20 terabytes of disk into a single rack unit today of
magnetic hard drives.
So the big question is how do you bridge these gaps. I mean, for applications
that have, you know, some IOPS requirement between these two but capacity
requirement also between these two, how do you bridge these performance and
capacity gaps.
So this talk focuses on helping network appliance caches bridge this gap,
network appliance caches like web proxy caches, like [indiscernible] caches
which cash static objects and then network access which have the ability of
2
cashing dynamic content based on content fingerprinting and inline data
duplicators which are essentially trying to index all the data that you're
trying to deduplicate and try to do inline data deduplication before you store
certain things on the disk.
And also file system and database caches. And this talk essentially is trying
to help these systems scale up by using software and hardware techniques. So
the first technique that I'm going to talk about today is called HashCache.
This is from NSDI 2009 and it's going to be appearing in this talk. It's about
reorganizing your data structures such that you can bridge the capacity gap
between DRAM and disk.
The second talk is going to be about using alternative memory technologies,
which neither have the performance bottlenecks neither the capacity bottlenecks
of these two technologies and falls somewhere in between. And try to find out
the -- try to find the right way of using these technologies to bridge these
gaps for network appliances. So this was called SSDAlloc and was published in
NSDI 2011.
So before delving into the talk, I'd like to give a brief introduction of how
caching works, because that's the system that we're going to focus on. So
let's say there are three systems that are trying to talk to the server, and
these three systems are essentially interested in some content that
[indiscernible] red block over there, and these three clients are essentially
fetching the content from the server individually. That sort of increases the
network traffic if these clients are co-located and the server they are talking
to is the same one and the content is the same one.
So the way in which you solve this problem is introduce a box that is nearer to
the client, a smaller one, and this box essentially downloads the content
[indiscernible] to all the three clients want and the clients individually
fetch the client from the box that is nearby. So this increases -- this
decreases the [indiscernible] area link and also decreases the latency for the
client for access of the content.
So here's an architectural overview of how people view caches. And there's a
cache logic. Cache application logic that multiplexes between client
connections and server connections. And [indiscernible] from the server and
from the server to the client. And essentially architects [indiscernible] two
different parts. One is the cache index and the other is the block cache,
3
which is the data cache that it can cache on as much as possible.
And the cache index is used essentially to serve membership queries. So if a
particular object belongs to cache, it's currently cached, it fetches it from
the cache and starts the query.
And if it's not currently in cache, it fetches it from the network, stores it
locally on the storage and then continues the operation from thereon.
Now, this was a problem that was well studied during the 1995-2000 period. So
why reopen it, right? So the reason behind why this problem is important again
is that this capacity is outpacing memory capacity, as I said in the beginning
of the talk. This means that there's a problem of the cache overflowing on to
the disk itself and reducing the performance of these systems.
So I'll talk more in detail about what these trends are with respect to disk
capacity increasing beyond memory capacity, and point out how this actually
leads to problems for cache indexes.
So [indiscernible] different disk properties from over the last 32 years, how
they have changed. The largest disk [indiscernible] obtained in 1980 was 80
megabytes in size and today you can get a single [indiscernible] three terabyte
hard drive, and this is about 100,000 times better and it's almost device
increase every two years. If you look at the last 32 years.
The seek latency, unfortunately, only increased by 2.5 times, and the seek
latency literally translates into the number of seeks that you can do, so the
IOPS haven't gotten better, but the capacity has certainly gotten better for
disks.
So this rapid increase in capacity means that larger indexes are needed to be
able to index all these disks, and the bad seek latency means the index cannot
overflow the disk, because if the index overflows the disk, you're essentially
reducing the IOPS, the performance of the system itself, because now you will
be using the seek for cache lookup and also one for serving the actual request.
The obvious question is, I mean, why do you need these big caches. I mean, why
do you need a three terabyte cache, right? So if you look at how data has been
growing, there's this survey done by Cisco last year amongst customers who were
using [indiscernible], and they found out that raw unstructured Dat, like email
4
documents and photos has been increasing by 40 to 60 percent every year. This
means that the number of objects that need to be stored, that need to be cached
is also increasing every year.
And data is also being accessed more. There's this new informal law,
Zuckerberg's law, which states that the number of pictures being viewed on
Facebook is doubling every 18 months, which means that people are not just
creating this data, but they're also accessing more of this data more
frequently.
So it's not just that they're putting it to disk and never accessing it. So
performance is actually important for these systems. So there's a reason why
index operations must be fast. You need to be able to find out membership
queries [indiscernible] system really fast.
And the third problem, the third friend in terms of DRAM price limitations or
DRAM capacity limitations are just causing indexing problems is the following.
So this is data from 2011. If you look at the price of a server and vary the
amount of data in a server, the cheapest, 64 gigabyte -- the cheapest server
that you can get for 64 gigabyte DRAM is probably not more than a thousand
bucks. But the cheapest server that you can get a single system match with two
terabytes of DRAM is actually request B a quarter million dollars. And the
increase is actually super linear. And not just in terms of initial cost, but
also in terms of cost of ownership, because high latency data actually
increases super linear [indiscernible] costs and the logic boards necessary for
being able to have higher density data on your system are also more expensive
on a single system image basis.
So once we have all these bottlenecks, the question is can we improve indexing
efficiency so we have data itself is increasing and we also have disk
[indiscernible] problem.
So the first system, as I said, is going to be a purely DRAM disk talk, which
is trying to address these growing gaps and trying to reduce the size of the
indexes itself so that with the limited amount of data, just a few kilobytes
are megabytes of data, you can index terabytes of hard drive.
So these are the high level goals for HashCache. Interface-wise, it looks like
a simple key value cache. Your keys and values can be of varying lengths and
it also needs to have a good replacement cache replacement policy for any
5
cache. The better the replacement policy, the better the hit rate, which means
the performance is going to be.
So that those are the requirements from an interface perspective. From the
perspective of solving the problem itself that I'm talking about, we need to
reduce indexing overhead, as I said. We need to answer the question of, is it
actually possible to index terabytes of storage with a few kilobytes or
megabytes of DRAM and it is also possible to trade memory for speed, which
means that the more memory you add to index, the higher the performance should
be. Is that such indexes, are such indexes possible is the question asking in
HashCache.
So before delving into what HashCache actually is, I'm going to give you a
brief overview of how people actually build LRU caches. So various
functionalities, [indiscernible] cache is one. Existence identification, you
need to be able to tell if a certain element belongs to a cache. People build
hash tables with chaining pointers and hash value to, you know, figure out if
something belongs to the cache. And for implementing a replacement policy,
they have LRU pointers across the entire data set that they have. And the
location information is, you know, the place on the storage that, the location
on the storage where the actual object resides. And there's other material
information related to the objects themselves.
So I'm going to be talking about two systems [indiscernible] comparison. One
is popular open source caching called squid that has about 66 bytes of object
requirement for creating index. [indiscernible] actually goes into a 20 byte
hash value. And the 20 byte hash value itself is used for creating a storage
location on the file system so they don't have a specific requirement for
storage location itself.
And [indiscernible] systems try to improve on this. They store a much smaller
hash value in memory, as opposed to storing a large hash value, and there is
all collisions on the storage itself. And the memory requirement comes down to
about 24 bytes per object.
So in HashCache, we try to rethink the cache indexing itself. The requirement,
as I said, the index requires very little amount of memory during index large
storage, and we design memory and storage data structures for network appliance
caches for this purpose.
6
As I said, it's an in-memory index that can be used for a network of LAN caches
like HTTP caches and WAN accelerators.
It's about six to 20 times more DRAM efficient than state of the art indexes
with good replacement policies and it was selected as one of the top ten
emerging technologies of 2009 by MIT Technology Review magazine.
The approach is, the approach of HashCache is the following. The bulk of the
memory usage in these systems actually goes into storing pointers, pointers for
implementing hash table chain collision and pointers also for implementing
least recently used replacement policies. So the approach inside HashCache is
to try to get both of these things, collision control and replacement policy if
they're actually using any pointers.
So here's an overview of the HashCache techniques. There are eight policies
that are there in the paper, and we're going to be talking about three policies
in the talk. But four of these policies are actually given. The Y axis is the
indexing efficiency. There is essentially, you could think of it as how
efficiently they're using [indiscernible]. The X axis is performance.
So as you can see, there are varying policy of HashCache that can actually
match the performance of open source system at ten times -- at the same time,
at ten times for memory efficient. And there are also versions of HashCache
that are actually five times more memory efficient than state of the art
commercial systems but still be able to match the performance.
>>: So is the [indiscernible] for index efficiency the reciprocal of the bytes
per objects stored shown previously?
>> Anirudh Badam: It's given in terms of dollars for -- it's given in terms of
number of index entries that you can store per dollar.
>>:
So that's equivalent to -- is that a good thing I said or is that.
>> Anirudh Badam:
[indiscernible].
>>: So I can I read that factor of ten to mean you're storing 6.6 bytes per
index object?
>> Anirudh Badam:
The ten times more is essentially for a given amount of
7
data, the HashCache can actually store ten times more entries.
>>:
Okay.
>> Anirudh Badam: So that means that the index can retain ten times larger for
a given amount of data.
>>:
Ten times more objects for ten times --
>> Anirudh Badam:
>>:
Ten times more objects.
Okay.
>>: The two aren't equivalent. Adding more disk space is very different from
adding a large number of objects.
>> Anirudh Badam:
It depends on the objects as a sufficient --
>>: [indiscernible] 1K object is a much harder problem than caching 100
million ten meg objects.
>> Anirudh Badam: Absolutely. So I'll talk about varying object distribution
sizes and the first part of the talk is going to be able to object distribution
which is essentially HTTP based and the second part of the talk is going to
talk about object size distribution such as more conducive to network
acceleration and essentially network content fingerprinting and also for
storage content fingerprinting useful for deduplicators.
So here's the first HashCache basic policy that basically talks about how to
[indiscernible]. Let's say you have a URL and data corresponding to this URL
are these four blocks of grown data that I've shown you. What HashCache does
is it takes H bit hash value of the URL, it uses some hash function H bits, and
it uses a part of the storage system as an on-disk hash table. What it does,
it takes N contiguous blocks on the magnetic hard drive and uses it as a hash
table.
What it does is it takes [indiscernible] end of the hash value that you have
and obtains a value D, let's say. On the T block on the hash table, it shows
the first block of the data and as I said, HashCache is a requirement of
varying sizes for objects. The question is -- the question, naturally, is
8
where do you put the rest of the data? If all the objects were of same size,
you could have essentially had all your disks as a single large hash table.
So what HashCache does is uses a large portion of the magnetic hard drive as a
circular log and stores the remaining part of the data in the circular log.
And the first block, the hash table, the N contiguous blocks hash table that we
call the disk table, stores a pointer to the log where the remaining data is
actually stored. And as you noticed, this requires no bits per entry in the
DRAM, because the hash function that you have can be represented in constant
number of bits that you can store in Dat um, and it does not have any DRAM
required for storing. You just get the hash value and you're going using the
disk as a hash table.
The advantage, as I said, is normal index memory requirements and it's tuned
for one [indiscernible] most objects. If you design around this cache table
with a build size of, let's say, 70% of the objects are smaller than the bin
size of your hash table which means that 70% of your objects can now be
accessed in one seek on the disk.
The disadvantages are it's actually one seek per miss, which means that misses
for any cache are not going to increase your performance. They're just a
nuisance, increase the latency, not useful for the application performance.
And this sort of system, which is using disk as a hash table would be wasting a
seek [indiscernible] and it's increasing the application performance in all
scenarios, not just hits.
And there's also no collision control. The collision control is implicit in
the sense that if you have two URLs having the same hash value, they'd be
replacing each other. And there's also no cache replacement policy. The cache
replacement policy is also implicit in terms of hash collisions.
So I suggest the following improvements. The first improvement that we saw to
this technique is collision cruel. What we do is instead of using a simple
hash table we use a set-associative hash table on the disk, and we call this,
let's say the associate [indiscernible] and these are configureable in
HashCache. And the first optimization of this basic policy.
The second one is we would like to avoid seeks for misses. And for this
purpose what we do is we store a small low false positive hash value off each
9
URL in [indiscernible]. Now, the location inside the hash table itself gives
you a lot of bits about the hash value itself so you don't need to store all
the hash [indiscernible]. You only need to store a very small amount of bits
because the location inside the set associative hash table already gives you a
fair amount of bits of the hash value.
And third thing we do is we add a replacement policy and we add in an LRU
replacement policy. What we do is we do LRU within each set of size T and for
representing the rank of the -- the LRU rank of each element in the T size set,
we only need [indiscernible] as opposed to an [indiscernible] representing
global [indiscernible] information.
Now, this means that the replacement policy is not global but is local to each
particular set and I'll talk about how local LRU replacement policies work well
in practice when you have varying different kinds of cache [indiscernible]
rates.
So here's a graphical representation of how this HashCache policy called setmem
works in practice. You have the same set of input that the data and the URL of
four blocks. You hash the data and you take [indiscernible] the number of sets
that you have in your hash table and you get the value T. As I said, there are
two hash tables. One an in-memory hash table that stores the hash values and
one anonymous cache table that stores the first block of the data.
So in the [indiscernible] set in a memory hash table what you store is a low
false positive hash and you also store the rank bits. The rank bits represent
the [indiscernible] rank information of each element within the set that you
have. Those are stored in the [indiscernible] in memory hash table.
The first block of data is stored on the in-memory on-disk hash table and the
rest of the data goes on to the log like in HashCache basic policy and the
on-disk hash table stores a pointer to the log.
And as I said, the LRU is now within set as opposed to being a global LRU
policy.
In practice, we need 11 bits for implementing this -- for this implementing
this caching policy. The advantages of this are that no seeks for most misses.
Most misses can be answered using the low false positive hash value that you
have. If the hash does not exist in memory, you need not go to the disk.
10
Optimized for -- it is optimized for one seek per read, which are hits and one
seek per write, which is essentially you take a miss and go fetch the
[indiscernible] from the network and you want to cache that object right now
and that's right in the cache.
And it's a good hash replacement policy in just 11 bits. The disadvantages is
that writes still need seeks. So if you look at any sort of system that tries
to optimize operations for magnetic disks, you like to coalesce all your writes
and write them at the end of the log. Essentially trying to maintain the disk
as a log structured system. That means that most of your random seeks are
going to be used for [indiscernible] as opposed to being used for writes.
So we propose the following improvements to HashCache. The first thing we do
is avoid seeks for writes. Write everything in a log and eliminate the disk
table. There's no [indiscernible] hash table. The natural question is where
does the pointer go. Pointer to the log. The log location is actually stored
in memory itself. This is not a pointer -- there is not like a regular
[indiscernible] pointer. It's a pointer to the storage location so it's
essentially location information.
The second optimization we tried to do is avoid seeks for related objects.
Let's say one of you goes to the main page of Microsoft and fetches the web
page. And then let's second person goes to the main page of Microsoft and
fetches the content. The first person who downloaded the content would be
doing a lot of random seeks and fetching all the content. There's no reason
why the second person should also go in [indiscernible] random seeks. Even
objects that are on the Microsoft page that are related to each other that will
be used at the same time, then it makes more sense than sorting them
continuously on their disk so we do this optimization as well.
Whenever you fetch content -- whenever you use content or fetch content related
to the application and each express seeing that this content [indiscernible]
accessed together, you actually store the data continuously on disk. And
that's also an optimization that HashCache supports.
So here's the graphical representation of HashCache log policy that you have.
You take the hash value. You take the [indiscernible] sets of the hash value
and you get the value T. There's only memory hash value. There's no on-disk
cache table. The on-disk data structure is a simple log and the LRU is within
the set.
11
The first -- yeah, and the [indiscernible] set in memory [indiscernible] is you
store a low false positive hash. You store the [indiscernible] necessary for
representing a lot of your information and you also store the location bits.
The location bits is essentially stored the pointer to the location in the log
where the data is stored.
In implementation, we needed about 43 bits per entry for implementing this
HashCache log policy.
Now move on to evaluation. For evaluating HashCache, we used a system called
web polygraph. It's a de facto testing tool for web proxies. They test hit
rate, they test latency and transactions per second of a given cache, of a
given size and of a given hit rate. And we tested all these policies.
>>: In the previous slide, you're saying you use a hash value, you look in
memory table [indiscernible] disk. Isn't that essentially, haven't we gone
back to having a hash table with ->> Anirudh Badam: Sure. But instead of having pointers for storing a lot of
your information and for chaining pointers, we don't have any pointer
requirements, which means that the [indiscernible] overhead is quite low and
the original point, I think he [indiscernible] should lead to higher
performance actually transfers. So the highest cache basic does not have any
memory requirements. The performance is really low because uses seeks for
misses. HashCache setmem on the other hand stores a very small false positive
hash value in memory. Gets you a slightly higher performance.
HashCache log gets you the best performance but uses the most amount of memory.
So it gets you that rate in terms of memory requirement versus performance.
But all three are dramatically more memory efficient than state of the art open
source and commercial systems. Almost 6 to 20 times.
And we compare all variant of HashCache with squid and tiger, which are open
source and commercial systems respectively. Tiger was commercially, is a proxy
of a commercially deployed, commercial CDN that isn't used today and our test
box is two gigahertz CPU, 4B of DRAM and five 90 gigabyte hard drives. The
paper actually tests hard drives as large as four terabytes as well.
But here I wanted to show the performance, the performance implications of
12
using multiple disks and trying to get as many IOPS as possible inside a single
HashCache system.
So this is how the indexing efficiency of HashCache looks. The X axis is
essentially the largest, the size of the largest disk that can be indexed using
a gigabyte of memory for various caching techniques. Open source and
commercial are barely able to index 100 gigabyte hard drive using one gigabyte
of DRAM. HashCache basic can potentially index infinite amounts of disk
because it does not have memory requirement on a [indiscernible] basis.
And HashCache log indexes about 1.5 terabytes of hard drive and HashCache
setmem about six terabytes of hard drive for a single gigabyte of DRAM.
>>:
This is assuming some standard object distribution size?
>> Anirudh Badam: Right. So the object distribution size comes from web
polygraph, which was used by commercial proxy during the 2000 period and were
getting their distributions from proxies that were deployed across many
[indiscernible] and Cisco systems.
>>: And do you have some sense for, in today's web, how much increasing hit
rate you get from indexing a larger disk?
>> Anirudh Badam:
>>:
Absolutely.
[indiscernible].
>> Anirudh Badam: So we originally started this project trying to see if you
can provide offline Wikipedia sort of systems to country like Africa and India.
And in that set, I mean, offline sort of removes the scope of the involvement
of the user. So that's when we thought, okay, the right match would be
something which can have large amount of content offline but also have some
amount of participation as well.
So that's when we thought large caches actually makes a lot of sense in those
sort of settings and we actually went and deployed some of those systems in
Africa. We populated with a large number of Wikipedia entries. Those are not
totally serves as a cache but also serves as an offline Wikipedia and offline
encyclopedia sort of a thing.
13
So in those scenarios, it is useful. And the second scenario in which it is
useful is this is for static content so you see this heavy tail in terms of for
static content. And when the content itself is increasing in size and you want
to do content fingerprinting with caching, when you want to do dynamic caching,
[indiscernible] storage out of [indiscernible] you actually see higher benefit
of this IMC paper by my group made last year, where he mined the logs of the
system called coding that my advisor has. It's a wide area CDN deployed on top
of planet lab. He did some experiments of what happens when you do
content-based caching as opposed to static HTTP caching and he saw that as the
disk size is increased from one terabytes or two terabytes, that he did
actually increase from 16 to 27%.
So you don't see that long tail in terms of dynamic content and static content
[indiscernible]. But the application scenario originally was for the offline
Wikipedia setting of developing regions.
>>: [indiscernible] the web polygraph thing, those object sizes and
distributions were based on research done in 2000?
>> Anirudh Badam:
>>:
Right.
So the web's changed insanely in ten years.
>> Anirudh Badam: I'm going to come to that. The next system, which actually
takes the characteristics of the modern web. This was also an ICM paper last
year of a group-mate of mine, and it shows the difference between how web
changed from, you know, from the web 1.0 days and how object-sized
distributions have changed and how it actually fixed the performance of these
sort of systems.
So in terms of memory usage, here is what HashCache does. With just 80
megabytes of memory, it provides about -- it is about five times more memory
efficient than state of the arm commercial system and still be able to -- and
is still able to perform well. Still able to compare in terms of performance.
And there's also a version of HashCache which has 100 -- probably 80 times more
efficient than an open source system and still be able to perform
comparatively.
So the normalization of memory requirement is in terms of the highest
performing HashCache requirement, which is 1X and the memory requirements are
14
shown in terms of the memory requirements of HashCache setmem, HashCache log.
>>: As I look at this, you're getting measurement in performance of requests
per second per disk. But I haven't seen how you handle multiple disks.
>> Anirudh Badam: So simply strike blocks across multiple disks, because it
essentially is a hash-based index. You can do some sort of trivial lower
balancing across multiple disks. You can even strike blocks across multiple
disks and try to get performance in parallel as well.
>>:
Okay.
>> Anirudh Badam: So we've had some deployments in Africa. We still have a
deployment going on in Uganda, and these are the photographs from those
deployments.
The question is how scaleable is this model in terms of if you have just a
purely DRAM and disk system, you optimize your data structures so that you can
bridge these gaps. I mean, for how many applications can you do this, and is
it even possible to do it for all the applications. When do you sort of run
into the actual limitations for an application in terms of you need more DRAM
for performance sort of a thing.
So scenario where this can happen is when the object size is much smaller and
even with a very nice technique like HashCache, you do not have enough DRAM to
store all the index. If your object size, let's say, 128 bytes or 256 bytes,
if you're doing aggressive network caching or aggressive deduplication for your
storage system, this could happen. Very small object sizes.
And for these scenarios, you still need indexes that are much -- that could be
much larger than the index that you have.
So as I said, I mean, there are these disk properties that have gone better.
So if you look at how much time you actually need to exhaust the entire disk by
using random seeks, it used to take 600 seconds back in 1980. So reading 30
megabytes using random seeks of four kilobytes, eight kilobytes each takes 600
seconds. But now it takes 270. That's almost [indiscernible], right?
So what this means is that even though capacity has increased, random seeks not
15
so much. So this is 40,000 times worse today and this scenario does not
actually go away if you use faster disks. Even if you have 15,000 RPM disks,
you probably only get twice as many random seeks per second. And the capacity
-- and there's also flip side of using a faster disk, because then the largest
capacity disk that you can get for a higher speed spindle is also much smaller,
because it needs to, you know, it's a mechanical device, and it can only do so
much, the motors.
So I'd like to reproduce this graph from the beginning of the talk where
there's super linear increase in terms of cost of having more amount of data in
a single system image, but there's a nice price point here. There's
[indiscernible] technology, flash memory. If you look at high speed flash
memory devices like fusion IO, you can get a ten terabyte SSD today for the
price point that you won't even get even a single terabyte of DRAM inside a
single system image.
So the second part of this talk is going to be about new memory technologies
like flash and here's a small primer of flash memory. You get about ten
dollars a gigabyte. But if you're doing bulk, probably can get five to eight
dollars as well. And it's about 1 million seeks per second per PCI bus that
you have if you're multiple bus in the system, like a [indiscernible] machine.
The IOPS actually scale through the number of buses that you have.
And it's also very high density. You can fit as much as 20 terabytes as flash
in a single rack unit. And this also actually scales with the number of buses
that you have. So the more buses you have, the more PCS slots you have, the
more 20 terabyte hard drives you can have in a single rack unit.
And the advantage is that there's no seek latency, only read and write latency.
And there's also a flip side which makes things more interesting, I guess.
Writes happen after an I racy, and these erases are usually 128 kilobyte
blocks. And the numbers of areas that you can do is also limited in terms of,
let's say, you could probably only do 10,000 [indiscernible] per second of
today's state of the NAND flash devices. So reducing the number of writes
helps and reducing with performance and also the reliability of the device. So
if you want the device to last longer, you would want to do as little of these
[indiscernible] as possible.
So now the question is, you know, now this you have this technology which falls
in terms of performance and capacity in between these two technologies, how do
16
you want to use it? Do you want to use it as a fast disk, or do you want it
use it as a slow memory. So this is a simple experiment that we did. We took
my sequel in [indiscernible] and then my sequel in a box.
The first bar over there shows you performance of what happens with TPCC when
you don't have any flash in the system, okay. That is normalized to
performance 1. Let's say you add somewhat of a flash into the system and you
use it as a transparent block cache for the magnetic hard drive and you
download the TPCC benchmark on [indiscernible]. Also scored as modified. You
get about 60% performance improvement. 55 to 60 percent performance
improvement.
Now, you take the same system, instead of using the non-flash device as a
transparent block cache for the magnetic hard drive, if you use it as slow
memory, as you configure it to use -- you configure my sequel to use the flash
memory device as buffer pool, you get a much higher performance. This is very
[indiscernible]. In either of the cases, no source code modified. In one case
you are using it as disk and in one case you are using it as memory.
The reason this happens is the latest structures that you build for memory are
more flash aware in general and the kind of optimization that we've done for
[indiscernible] do not apply for flash memory and they add some latency in
terms of software. And that leads to these overhead in terms of using flash as
slow disk as opposed to using [indiscernible].
So there's a reason why my proposal is to use flash as slow memory as opposed
to using them as fast disks.
So if you look at state in the art in terms of how one could use a flash memory
disk as memory is use it as a swap disk or a map file on the device and there
you go. It's available by [indiscernible] flash memory device. The advantage
is that applications which are using [indiscernible] map need not be modified.
They can certainly have large amount of memory. They're just flash banked.
And it does not -- but the disadvantage is that it does not address SSD
problems. It is not flash aware. It does not do any write optimizations. And
as we look to a lot of random writes because random writes are not a problem
with respect to [indiscernible] memory or, you know, traditional physical
memory technology, which is DRAM.
So to address these bottlenecks, to address these issues what people usually do
17
is they [indiscernible] application. They optimize the applications for writes
on top of flash memory, and the disadvantage is that you don't only need
application expertise but you also need SSD expertise. You're modifying the
application to be more flash aware and flash friendly.
So the premise we had [indiscernible] can we get best of both worlds. That is,
application need not be modified. That's the ideal case. But is there a case
which the number of modifications of the applications have minimized, but
you're still optimizing for writes. So this is the premise behind SSD
[indiscernible] and we tried to solve this problem.
So we started off with a simple approach, a non-transparent approach. What we
did is we created a custom object storer in a flash memory device, which is a
log structured object store and the log structure object store simply gives you
custom pointers to create, use, read and write to these objects that you are
creating on top of the SSD.
And you read and write objects as necessary. As I said, the SSD is managed as
a log structured object store. It not only minimizes the number of areas that
you have on the SSD, but also optimizes for writes because the object -- the
writes are now done to objects as opposed to being done to blocks or pages,
like virtual memory systems do.
But the disadvantage of this approach is it still needs heavy modifications.
You need to know where the objects are there in your system. You need to
appropriately read them and write them too from the SSD. For example,
HashCache, when we tried to convert HashCache to this type of object store sort
of a model, where we were storing each set in [indiscernible] HashCaches object
on the SSD, we needed [indiscernible] lines of code to be modified within
HashCache. That was about five percent of the entire HashCache, just the
indexing part, not the HTTP proxy part.
Now, there's a fairly large number of lines of code. I build HashCache myself
so it was easy for me. But for someone who is not familiar with the
application, and for someone who is not familiar with the SSD, it might take
much longer and it might be much harder.
So that's when we thought of this technique of can we use the virtual memory
hardware? Can we use pointers themselves to figure out what objects are being
read to and what objects are being read from. Can we use virtual memory to
18
assist as the object handles.
That's when we had this idea.
So you could use page faults to detect reads and writes, which data is being
read in a system, and on fault, you materialize the data from the SSD and
manufacture a page. And you remove idle objects to the SSD's log, right.
But the problem is that, you know, pages are 4 KB in size. Virtual memory
systems, any modern virtual memory system does not give you access information
of anywhere less than four kilobytes. There were systems which were doing this
for one kilobyte and much smaller in size. They do not exist anymore. And all
the [indiscernible] systems, the smallest [indiscernible] you can configure is
four kilobytes.
And so then we had this wild idea of aligning objects to page boundaries. So
instead of having contiguous virtual memory allocations, why not give each page
to each object and the virtual memory system itself will tell you which page is
being modified. And if you have this model called one object per page, you
know precisely which object is being modified. The rationale behind this crazy
idea is that virtually memory is cheap, right. So you have 40 [indiscernible]
of virtual memory space and this is much larger than the physical memory that
you have in the system. But physical memory is not cheap.
So you still need to configure -- you still need to build a system size that
even though there is frivolous use of virtual memory, there is no waste of
physical memory. So that's the technique in an SSDAlloc.
>>: Virtual memory is not that cheap either.
per page does to your [indiscernible].
I'm wondering what one object
>>: Absolutely. So the premise we had in SSDAlloc is it's for improving the
performance of network appliance caches. So the network appliance caches are
usually bottlenecked boy the network latency and other aspects as opposed to
just being CP modeling. So for these sort of systems, let's say you create all
your objects, let's say we're talking about something like memcache. Inside
memcache, you have a hash table, which have lower values.
If you create your values using SSDAlloc, and you create your hash cable using
Malloc, the good thing about SSDAlloc us is it coexists with Malloc so if you
have some way of configuring your TLB in terms of data to come out of Malloc
and your IO intensive data to sort of come [indiscernible] and network data to
19
come out of SSDAlloc, you'll not be seeing so many TLB misses. The TLB misses
will be equal to the, let's say the IOPS of the memcache box itself, which most
modern CPUs can actually handle fairly well.
I'll come to the overhead of SSDAlloc in more detail.
So here's an overview of how SSDAlloc works. Let's say you have a memory
manager like Malloc. In this case, it's SSDAlloc. Let's say you create 64
objects of size 1 kilobyte each, okay. What a traditional Malloc memory
manager would do is it would take 16 four kilobyte pages, spread it into 64 one
kilobyte objects and return these pointers to the application.
What SSDAlloc does is it creates 64 different four kilobyte virtual memory
pages and allocates these 64 pointers as these one kilobyte blobs of data to
the application. The three kilobytes portion in each of these virtual memory
pages is not utilized for allocation in future.
So clearly, I mean this frivolous use of virtual memory can lead to wasting of
physical memory so what we propose is we have a very small page buffer. We
restrict the number of pages, the number of virtual memory pages that are
recorded all point of time to be a very small quantity. Let's say five percent
of your data, which is virtually memory pages that are resident in core at all
points of time.
The remaining 95 percent or remaining large portion of data is actually used as
a compact object cache, where these objects that you are creating using this
memory allocator are stored compactly using custom hash table interface.
And because the virtual memory system tells you which page has been modified,
you know precisely which objects have been modified, which means you can
implement the SSD, use the SSD as lock structure object storer.
Now, that is the rationale behind -- this is how the SSDAlloc system works.
So to implement the system, what we did was we -- all the virtual memory that
is allocated to the application is protected by [indiscernible] protect, and
when the application tries to access any of this data, it takes a fault. In
the fault handler, we figure out where the object is. Is it currently in the
ram object cache or in the SSD. If it's in ram object cache, we've
manufactured the virtual memory page immediately and return the control back to
20
the application and the fault handler.
And if it's not, we fetch it from the SSD, cache it in ram object cache,
manufacture a page and put it in page buffer and return the control back to the
application.
So coming back to Andrew's question, this is the overhead in SSDAlloc. So we
took a 2.4 gigahertz quad core CPU, about 8 megabyte L2 cache. Just for TLB
miss, so is not a miss that goes for DRAM, but just for the hardware to figure
out that this is a TLB miss, it takes about 40 nanoseconds. And this
translation for SSDAlloc which translates virtual memory into SSD locations,
that's about 46 nanoseconds. Page materialization, which is take a new
physical memory page, copy the data from ram object cache into this physical
memory page that takes about 138 nanosecond. And the combine overhead of all
these process, manufacturing a page on a fault is about 800 nanoseconds.
And the bulk of the overhead is still NAND flash latency, which still about 30
to 100 microseconds. So if you're still bottlenecked -- if your application
performance is still bottlenecked by the random IOPS of the NAND flash device,
then you're not so much bottlenecked by TLB itself, because the NAND flash is
probably doing somewhere between 300,000 to 400,000, even a million IOPS per
second.
And in those cases, the TLB is not going to be the bottleneck. So here's a
results overview. We modified four different systems memcache, a boost library
B plus tree, a packet cache, which is essentially caching 800 packets inside a
network and trying to do a network [indiscernible] and we also modified
HashCache.
The modifications required only 9 to 36 lines of code. Them are simple
[indiscernible] SSDAlloc. But some of the systems are more complex, for
example, memcache, what it does is instead of Mallocing individual objects,
creates its own memory allocator. Each time, it allocates a large two megabyte
or [indiscernible] portion and wills its own memory allocator. So to modify
such systems, we needed more [indiscernible] lines of code.
So in terms of performance, SSDAlloc obtained up to 17 times more performance
than unmodified Linux swap. And up to 4.5 times for performance over log
structured swap. So clearly Linux, the native Linux swap is not a good
comparison point because [indiscernible] is not build for flash memory devices.
21
Originally built for disks. And in the disk world, once you started swapping,
performance really did not matter. It was more like a barrier before killing
an application. It was like a worst case performance sort of a thing.
So that's why we modified the Linux swap to be log structured, where it was
storing the pages in a log structured fashion. We also showed this as a
comparison point, and the reason why it performs 4.5 times better than modified
Linux swap, such as log structured swap is because a write is actually
restricted to object sizes, as opposed to marking data [indiscernible] which
means your writes are much lower for random writes.
And most of these systems that we modified are hash based systems, which means
the writes are going to be randomized and reads are also going to be
randomized.
And because our writes are restricted to object sizes, and we actually cut up
the size of the object at 128 bytes. So you might be wondering for the one
object per page model, not only TLB, but your page tables are also going to be
larger, right? So if you have, let's say, two byte objects, five byte objects,
you don't want to create a page [indiscernible] entry for a five byte object.
So what we do is we cut off object size at 128 bytes and if you have an object
requirement of less than 128 bytes, we cleared those objects from those
128-byte pages.
So we build a coalescing allocator on top of these 128 bytes, byte pages and
for any object size that is less than 128 bytes, we created from the hard
memory allocator, which means that the smallest object size that you can have
in SSDAlloc is 128 bytes. And if you have random writes, it means that it's
going to write [indiscernible] times less data to the SSD because in a standard
virtual memory system you're going to be marking data [indiscernible]
granularity.
So here are some micro benchmarks to enhance or to shot benefits of SSDAlloc.
What we did is we took a 32GB array of 128 byte objects and we created these
objects using SSDAlloc and Malloc in in unmodified Linux swap and also log
structured Linux swap. And we varied the number of reads and writes in the
system. We started off with all reads. And then we went on to all writes.
So if you look at the through-put of number of requests per second that you
have in terms of thousands of [indiscernible] per second, these are normalized
22
performance increases. If you look at how SSDAlloc performs over unmodified
swap, the performance is anywhere between about 2.5 to 15 times, 14.5 times.
That is, as the writes are increasing, the performance of SSDAlloc is much
better. This is [indiscernible] in terms of SSDAlloc requiring to write much
less and doing sequential writes as opposed to random writes that unmodified
swap does.
And if you look at the SSDAlloc's benefits over log structured swap, the
benefits are still in terms of having to write much lesser for random writes.
This is because our writes are restricted to object sizes as opposed to four
kilobyte pages. And we verified these results against five different SSD
vendors in the paper.
I encourage you to read the paper for a detailed set of results.
And here are some benchmarks for memcache. We took 30 gigabyte of SSD and 4GB
of DRAM and four memcache clients were running or essentially sending requests
to this particular memcache server. We modified the slab allocator of
memcache, as I said, to use SSDAlloc, and these are the performance results.
The X axis is the average object size inside memcache. As the average object
size decreases, the performance of SSDAlloc increases. But if you look at the
performance of SSD swap and SSD log structured swap, the performance does not
actually deepen on the object sizes, because regardless of what the object size
is, they're marking the pages, the entire four kilobyte page is dirty, and
there would still be reading an entire four kilobyte page from the SSD even
though an object is required.
And this for a random workload like memcache.
So here's some of the current work that is going on that I've been doing,
trying to optimize SSDAlloc. The first thing that we are working on is
rethinking virtual memory management itself. SSDAlloc, obviously there's this
problem of 95 percent of your physical memory is available only after a fault,
and if you're working set sizes [indiscernible] working set size can actually
fit in DRAM, this is clearly a nuisance.
So for this purpose, what we do is we still want to maintain this one object
per page semantic. But we don't want the old [indiscernible] of having to take
fault for accessing a huge portion of your DRAM. So what we do is the
23
following.
What we do is we cleared sparsely mapped virtual memory pages. Let's take
these four virtual memory pages. Each of them are actually sparsely mapped.
The first one is mapped to the application in the first kilobyte. The second
one is mapped in the second one. And the fourth one is mapped in the fourth
one. Now, these virtual memory pages actually mapped to the application in
non-interleaving portions. Which means just by using some page table magic,
they can actually share online physical memory page, which means that you get
-- you're still going to get the advantage of undetecting reads and writes and
object granularity, but you'll still be able to back all your physical memory
using useful objects without actually having any page faults.
>>: At the top, you you've got a virtual memory page, but you're not using all
that page, only using part of it.
>> Anirudh Badam: Right. And the usage is determined in Malloc. So if the
operation called for Malloc of one kilobyte, I'm going to give him one
kilobyte, and the contract between the Malloc user and the Malloc implementer
is that you're not going to use the data beyond these two pointers.
>>: So they're not overlapping.
all sort of fall into place.
So when it maps down to a physical page, they
>> Anirudh Badam: Right. So it still gives you pages which are smaller than
four kilobytes, which means you can do all the SSD optimizations that you want,
but your DRAM is still not wasted. So whenever these sparsely mapped virtual
memory pages are in core, they're going to be shading physical memory. That's
the advantage.
So that's the first work that is ongoing work. And the second ongoing work is
trying to do more virtual memory centric tiering. So what happens in
traditional operating systems is that you don't have a rewards pointer from
physical memory page to the virtual memory page. So traditional operating
systems, what they do when they have to select a candidate for paging out, is
that they have some sort of usage information, statistics, on a physical memory
basis. They don't have virtual memory usage basis.
So when they have to page out a page, they need to go to each individual
process, scan through the virtual memory regions and see which of these pages
24
has not been recently used.
Now, when they find such a page, the page [indiscernible] so this is a CPU
intensive process. There's a scanning of all these virtual memory pages that
is going on.
So in TEAM, what we did is we tried to make it more virtual memory centric. We
maintained virtual memory page usage statistics. So for all the virtual memory
pages across all the process, we maintained statistics, clean statistics, and
these statistics are actually used and this reduces the CPU requirement of
implementing page replacement policies in our operating systems.
And the second optimization we have done as part of our TEAM, which is TEAM
stands for transparent expansion of application memory, by the way. The second
optimization we have done in TEAM is that virtual memory page tables are
already translating some sort of [indiscernible]. They're translating virtual
memory [indiscernible] to physical memory [indiscernible]. And flash memory
FTLs also do some sort of translation. They translate logical block
[indiscernible] to physical block [indiscernible].
So there's two levels of indirection that is going on in the systems where you
would like to use flash memory devices as a virtual memory -- by a virtual
memory. So we tried to introduce this overhead and we implemented an FTL fully
using virtually memory page tables and as a sparse FTL. And that's the second
part of the work.
There's a third thing that we're trying to do. Trying to introduce some sort
of -- some amount of flash memory inside a cluster. And it's more like the
cluster is implementing a block device. A block device comes from a cluster,
and we are trying to do some sort of custom clearing within the cluster. So
we're trying to do, what we're trying to do is let's say you have a block that
is cached in DRAM on a particular machine.
There's also a block that is cached in DRAM -- the same block is actually
cached on DRAM of a different machine. Now, if these two systems are not
talking to each other, they might make suboptimal decisions of how to implement
page replacement policy. So we're trying to see if [indiscernible] can be done
on a cluster basis where these systems are actually communicating with each
other with respect to axis frequencies and they're doing some sort of
collaborative tiering so you don't make these decisions independently from
25
machine to machine.
So there's some ongoing work and this is some future work that I have. So
there's some amount of work in maintaining page tables for multitiered systems
that are using large amounts of virtual memory. Because we are talking in
terms of secondary faults, high level faults, these multiple threads need to
modify page table scaleability so there is some interesting work that can be
done in lock less data structure type management of page tables inside
operating systems.
And there's also some work that we can do in terms of how to tier flash memory.
Can look at two different models, one is the SAN model and the other is the
Hadoop model. And somewhere in between as well where there are some nodes
[indiscernible] which have more amount of flash memory than other nodes, which
means that they're more [indiscernible] for data.
And the third is in terms of reliability of transparent tiering system. If you
have transparent system, if the data is corrupted in one layer, it will
eventually be corrupted in all the layers, because the data still needs to
trickle down. So we need to try to figure out methods in which, you know, the
transparency link systems are more reliable and more secure in terms of what
kind of safety barriers you can [indiscernible] in terms of managing your tier
in the second level tier, the flash level tier.
So here's a summary of the talk. HashCache
seek-optimized cache index with an LRU-like
to ten times more efficient than commercial
to 50 times more memory efficient than open
is a DRAM efficient and
replacement policy. It's about six
systems, memory efficient, and 20
source systems.
It enables large disks with only net book-class machines. And SSDAlloc is a
complementary technique which is useful for building such hash tables, where if
the hash table size grows to much larger sizes, you can use SSD without
actually having to modify your application, and the modifications are
restricted to only the memory allocation portions of the core. And in most of
the systems that we modified was only 6 to 30 lines of code. And the systems
and the performing three to five times better than existing transparent ways of
using flash memory as virtual memory devices. And also, SSDAlloc also has the
benefit of increasing SSD life by up to 32 times.
Now I'll take questions.
26
>>: So you said that it performs a lot better than existing transparent
techniques. Can you relate it to existing techniques that are SSD aware?
>> Anirudh Badam: Sure. So in terms of this like I have this example. So
let's say you are implementing a [indiscernible]. One way you could implement
a [indiscernible] is take an existing [indiscernible] library that is built for
DRAM and use it as a swap device. And the second thing you do is instead of
doing these random writes in a [indiscernible] filter, at a 4 KB granulator,
you can do it as a much smaller granulator using something like SSDAlloc.
This is clearly beneficial, but in terms of -- SSDAlloc is clearly not aware of
all the applications trying to do. So if the application developer is aware of
a better data structure which supports this sort of a thing, you can not only
implement a data structure using this interface, but also the native storage
interface.
Now, we propose using virtual memory as the interface for flash memory, only
because if you look at new memory technologies like face change memory and
other memory sort of technologies, going forward, I mean, these technologies
are growing lower in latency. It makes more sense to build this system to use
as a memory, as opposed to using it as storage.
So this is a complementary technique. This enables users of flash memory on a
wide variety of systems. But data structures are really still important. If
your application developer can build a smart flash of a data structure, you
have to do it.
>>: That makes sense. Can you give me an idea of, I mean, are flash aware
data structures going to perform many orders of magnitude better than the
transparent things? I mean, in other words, have you gotten most of the
benefit or have you gotten only a fraction of the benefit you can get by -- I
mean, if the core of the application is one or two data structures that can be
rewritten, that can buy you additional order of magnitude again. Transparency
is nice, but really when you're all done, you're going to want to actually use
the custom [indiscernible] structure.
>> Anirudh Badam: So there's still a reason why you -- I mean, a lot of people
will spend a lot of effort in going and building the cache line of software. I
mean, even in a purely DRAM world, you modify your data structure so that your
27
caches are being used appropriately. So in terms of building a data structure
appropriate for the underlying memory technology, those sort of techniques will
still be important. So coming back to the bloom filter example ->>:
Are there additional orders of magnitude there, or --
>> Anirudh Badam: So in come back to this bloom filter example, you could
build a bloom filter in a regular way for using small page sizes, or you could
build a bloom filter such that all your [indiscernible] functions actually
reside in the same block, right. You could do some additional randomization to
make sure that all your [indiscernible] functions need for the bloom filter are
in the same block.
Now, this actually gives you, really gives you K times performance improvement.
So depending on what the [indiscernible] of your bloom filters, you'll get such
a benefit.
So I guess, I mean, this makes flash memory useful in terms of memory, because
building in memory data structures is much easier than building on disk data
structures. So in terms of development, it's going to reduce your overhead.
Performance will still heavily depend on the kind of data structure you build
and how flash aware it still is.
>>:
Thank you.
>>: Seems like most of the complexity in SSDAlloc comes from the fact that you
want to support existing on modified code to the extent that somebody calls an
allocator, gets back a pointer. And at any point in time in the future, they
can just go [indiscernible] with the state that that thing points to. So that
means you need to track updates through the virtual memory system [inaudible].
And so two questions I have. One, am I correct in thinking that this is
strictly worse than a system in which you have like an explicit access this
object, can I get a copy of it, now save this object?
>> Anirudh Badam:
Um-hmm.
>>: Have you considered whether an API like that -- like I'm thinking that it
would be not be very hard, either through compiler transformations or manual
code rewriting or a little bit of hacking to take some existing data structure
28
and modify it to insert these cores to say, you know, require release, for
example, you could do that when you're requiring release. Logs, perhaps, so
you could use TM techniques. You could do a little bit of magic that would
make a lot of the overhead of doing it completely at virtual memory system go
away.
>> Anirudh Badam: Definitely. In terms of what semantics that the programming
language it has, there are multiple optimizations you can do. One is let's say
you have some sort of transaction semantics. Then you can comfortably get away
with your ram object cache and the other requirement. If at the beginning of,
let's say, a function called your [indiscernible], these are the sort of
objects I'm going to be using.
Let's say, you know, you are quite -- the handle still could be virtual memory,
but it will get away from some of the overhead that SSDAlloc has. So in trying
to be completely transparent, yes, we definitely had to jump through a lot of
hoops. But in trying to be less transparent, I mean, it's sort of a trade-off.
So if you ever want to use custom object handlers or you want to be
[indiscernible] to existing applications which, you know, have used virtual
memory and then try to make their life easier.
>>: I guess my question is similar to John's. Do you have any sense in the
spectrum of the achieved performance where these different tradeoffs sit?
>> Anirudh Badam: Right. So we did a spec benchmark sort of with SSDAlloc.
And we also did a spec benchmark with the chameleon system I was talking about,
where all the physical memory is actually accessible using these virtual memory
pointers.
And so in that case, we actually saw a performance increase of about 16 times.
And this is purely in memory benchmark and does not use the SSD at all. But by
using SSDAlloc-like technique, where you are taking fault to actually access
let's say 9 percent of your data, that was actually causing the problem.
>>: I guess my question was different, which is if you were using SSD. What's
the overhead of doing it in a completely optimal handwritten data structure
versus using it if you're prepared to tweak your object access routines, versus
this? How do those things ->> Anirudh Badam:
So one pathological example that I gave, I just give was the
29
bloom filter example. There are these flash over bloom filters. What they do
is all their K hash functions actually fit into the same log. They use
additional [indiscernible] for this.
And if you use a regular bloom filter, which is not flash aware, which is just
built originally for DRAM, what you could -- the K random writes could actually
translate to K SSD random writes sop there you write [indiscernible] of K times
performance improvement, not only in terms of number of writes but also in
number of reads.
So these sort of examples that certainly, so in terms of what is the right
thing to do build a flash over data structure and what is the right interface.
People are fairly familiar with memory sort of building. I mean, they find it
more convenient to build in-memory data structures as opposed to building
storage-like data structures.
So that's the reason why we propose [indiscernible] to face moving forward with
new non-volatile memory technologies.
>>: In the SSDAlloc part of the talk, you mentioned different vendors. Our
market, we do see wide variety of different device. It's very different
performance characteristics. And your work tried to integrate SSD to virtual
memory system, probably is not many vendors to consider. I guess basically
wondering how the performance differentiation goes like across vendors systems.
>> Anirudh Badam: Sure. So there are some flash memory devices which actually
do more random writes than random reads. This is because they're bottlenecked
by [indiscernible]. And then if you're doing a lot of random writes, they're
doing, they're still doing a lot of sequential writes on to the SSD. And for
these sort of cases, I mean, because they have a fully -- well, fully
associated FTL. So the performance simply [indiscernible] down to the kind of
FTL that they have. So if your FTL is not optimized for random reads, random
writes, the performance would be very high with SSDAlloc. But if your FTL is
actually fully mapped and can do random writes very well, the performance
benefits will boil down to the object sizes. Let's say your object size five
[indiscernible] and the virtual memory system is actually mapping at four
kilobyte range, four kilobyte dirty.
Then in the SSDAlloc case, you would still be writing eight times less data.
So even in those cases, you get the benefitted. So regardless of what the FTL
30
looks like, you end up getting benefits from the SSDAlloc model.
>>: Another question. On the SSDAlloc, how much is additional basically in
memory data structure you need to maintain, basically to consider this
fragmented object.
>> Anirudh Badam: So the page table overhead for SSDAlloc, virtual memory to
actual SSD locations that you have. So for the page tables, yes, I mean, each
page table entry is probably 8 by ten size for 64 bit systems. And in those
cases, the page levels the data structure of the [indiscernible] SSDAlloc is
[indiscernible]. So the page levels can actually reside on the SSD itself.
So if you have, like, traditional virtual memory systems if you have a
four-level page translation mechanism, then many of these pages required for
representing the page table itself can decide on the SSD itself.
So depending on the working set size, if your working set size is small enough
that it fits in DRAM, then you wouldn't see much of overhead in terms of page
table -- the overhead of storing these mappings in DRAM, because the working
set size is small. Hopefully the page tables themselves will [indiscernible].
>>: So, I mean, in the beginning you mentioned DRAM.
large SSDs. Ten terabyte SSD.
>> Anirudh Badam:
Right now, we have very
Yes.
>>: So if you're actually basically mapping such a large DRAM space, what is
the data structure size? How much is that ->> Anirudh Badam: So with the current version of the fusion [indiscernible], a
ten terabyte hard drive would have about 32 gigabytes of FTL requirement and
the current fusion hard drives actually don't have a [indiscernible] FTL, which
means that all this FTL is going to be data. So I think this summer or by the
end of this year, they're coming up with an FTL that actually resides on the
SSD. Only the mappings that are currently being used by the application or by
the OS are actually stored in DRAM and the rest of the applications are stored
on the SSD.
So yes, moving forward, if you want larger flash memory devices, you need to
think about mapping overhead. And mapping overhead can be redundant as I
31
mentioned in one of the TEAM systems is by not storing page table mappings and
[indiscernible] mappings, trying to combine them, that is one sort of
optimization you can do to reduce mapping overhead.
The second optimization you could do is make the mapping data structures
themselves pagable so that you do demand fetching of the mappings itself.
>>:
How long is it to [indiscernible] SSD?
>> Anirudh Badam: So if you look at the fusion IO drive, it's an 80 gig hard
drive. It's [indiscernible] for five gigabytes of writes and as the latency
goes farther downward, what ends up happening, these companies would like to
give you warranty for three years. But as you go to higher capacity, the
number of areas that you can do on these devices is decreasing.
So right now, they're at a point where five petabytes is actually not enough
for three years. If you're doing writes 24/7. But there are not many
applications, fortunately, which are doing writes 24/7. If you look at more
realistic applications from, you know, the logs from the data center of their
customers, they can still last five years on the shelf.
But I guess once they move to [indiscernible] devices, they're going to have
this problem of using these MMC devices which are rated for only 5,000 cycles.
So what they're doing is so instead of giving you purely SSD devices, what the
flash vendors are doing is they're taking MLC devices, which are -- so if you
look at, let's say if you want 100 gigs of SLC and you take 100 gigs of MLC,
you would think the price of the MLC is certainly cheaper. But what ends up
happening is the economy [indiscernible] are different. So MLC is actually
cheaper than SLC, not only in terms per byte. Not only twice per byte, but
actually much lower than that.
So what these companies are doing is they're instead of buying SLC, they're
buying MLC and using MLC as SLC. So they're not using their two levels inside
of MLC devices. So actually gives them slightly higher number of write cycles
as well.
So when they go to [indiscernible] really low number of [indiscernible], you're
getting something like 8,000. So maybe for the next four or so years, they can
maintain the same data. You can write five petabytes, but it will still last
for five years sort of a number.
32
>>: A quick question. I'm wondering if [indiscernible] SSDAlloc in language
manages where you might be able to track object use without going down to the
DRAM but also not allowing changes of the language.
>> Anirudh Badam: Right. So what we can do is take something like JVM or
modify JVM to use something like SSDAlloc and then you readily have best of
both worlds. JVM itself is using the SSD in object oriented fashion, and you
also have this benefit of being able to translate pointers independently
without having to change things for the application.
So you can definitely do that. And instead of doing this route, what you can
do is you can go modify the JVM itself to be flash aware. Because they're all
custom pointers and you also have an indirection table within the JVM itself,
which is mapping the java pointer to a particular virtual memory pointer, let's
say. Those can be transparently modified as well.
So we tried doing this inside JVM and we essentially went inside JVM and try to
see what happens. And it was not so straightforward. JVM also has a
coalescing memory allocator that creates large five megabyte portions and then
splits those into objects and so it was much more complex than that.
>> Jeremy Elson:
Okay.
Thank the speaker.
Download