1

1 >> Jeremy Elson: Welcome, everyone, thanks for coming. It's my pleasure to introduce Anirudh Badam, who is here today from Princeton. He'll be telling us about how to use SSDs and other sundry devices to bridge the gap between memory and storage. And just in case you don't find that interesting enough, when you get a chance to talk to him, he also enjoys rock climbing and reading books about modern physics. So feel free to question him ruthlessly on those topics as well. Thanks a lot. Go ahead. >> Anirudh Badam: Thanks for the introduction, Jeremy. I'm Anirudh Badam. I'm from Princeton University, and I'm going to be talking about bridging the gap between memory and storage today. So the motivation behind the talk is that over the last two or three decades, the capacity and performance gap of memory and storage are increasing. These gaps are in terms of, not only in terms of IOPS per second, random IOPS per second, but they're also in terms of capacity. So if you look at state of the art memory units today, they can do anywhere between a million or 1.5 million IOPS per second. And if you look at state of the art disks, they can do between 100 to 300 seeks per second. There are some that do slightly higher as well. And with respect to capacity, probably can get somewhere between 200 to 300 gigs of DRAM on a [indiscernible] basis today, and that's the maximum that you can get. I mean, there are capacity constraints for really high density DRAM. Essentially not only the -- not only the logic board needs to be capable enough to have high density DRAM, but the high density DRAM is also expensive and does not [indiscernible] beyond a certain point at every point of time. There's eight terabyte of disk per rack unit is probably conservative, but you can get as much as maybe 20 terabytes of disk into a single rack unit today of magnetic hard drives. So the big question is how do you bridge these gaps. I mean, for applications that have, you know, some IOPS requirement between these two but capacity requirement also between these two, how do you bridge these performance and capacity gaps. So this talk focuses on helping network appliance caches bridge this gap, network appliance caches like web proxy caches, like [indiscernible] caches which cash static objects and then network access which have the ability of 2 cashing dynamic content based on content fingerprinting and inline data duplicators which are essentially trying to index all the data that you're trying to deduplicate and try to do inline data deduplication before you store certain things on the disk. And also file system and database caches. And this talk essentially is trying to help these systems scale up by using software and hardware techniques. So the first technique that I'm going to talk about today is called HashCache. This is from NSDI 2009 and it's going to be appearing in this talk. It's about reorganizing your data structures such that you can bridge the capacity gap between DRAM and disk. The second talk is going to be about using alternative memory technologies, which neither have the performance bottlenecks neither the capacity bottlenecks of these two technologies and falls somewhere in between. And try to find out the -- try to find the right way of using these technologies to bridge these gaps for network appliances. So this was called SSDAlloc and was published in NSDI 2011. So before delving into the talk, I'd like to give a brief introduction of how caching works, because that's the system that we're going to focus on. So let's say there are three systems that are trying to talk to the server, and these three systems are essentially interested in some content that [indiscernible] red block over there, and these three clients are essentially fetching the content from the server individually. That sort of increases the network traffic if these clients are co-located and the server they are talking to is the same one and the content is the same one. So the way in which you solve this problem is introduce a box that is nearer to the client, a smaller one, and this box essentially downloads the content [indiscernible] to all the three clients want and the clients individually fetch the client from the box that is nearby. So this increases -- this decreases the [indiscernible] area link and also decreases the latency for the client for access of the content. So here's an architectural overview of how people view caches. And there's a cache logic. Cache application logic that multiplexes between client connections and server connections. And [indiscernible] from the server and from the server to the client. And essentially architects [indiscernible] two different parts. One is the cache index and the other is the block cache, 3 which is the data cache that it can cache on as much as possible. And the cache index is used essentially to serve membership queries. So if a particular object belongs to cache, it's currently cached, it fetches it from the cache and starts the query. And if it's not currently in cache, it fetches it from the network, stores it locally on the storage and then continues the operation from thereon. Now, this was a problem that was well studied during the 1995-2000 period. So why reopen it, right? So the reason behind why this problem is important again is that this capacity is outpacing memory capacity, as I said in the beginning of the talk. This means that there's a problem of the cache overflowing on to the disk itself and reducing the performance of these systems. So I'll talk more in detail about what these trends are with respect to disk capacity increasing beyond memory capacity, and point out how this actually leads to problems for cache indexes. So [indiscernible] different disk properties from over the last 32 years, how they have changed. The largest disk [indiscernible] obtained in 1980 was 80 megabytes in size and today you can get a single [indiscernible] three terabyte hard drive, and this is about 100,000 times better and it's almost device increase every two years. If you look at the last 32 years. The seek latency, unfortunately, only increased by 2.5 times, and the seek latency literally translates into the number of seeks that you can do, so the IOPS haven't gotten better, but the capacity has certainly gotten better for disks. So this rapid increase in capacity means that larger indexes are needed to be able to index all these disks, and the bad seek latency means the index cannot overflow the disk, because if the index overflows the disk, you're essentially reducing the IOPS, the performance of the system itself, because now you will be using the seek for cache lookup and also one for serving the actual request. The obvious question is, I mean, why do you need these big caches. I mean, why do you need a three terabyte cache, right? So if you look at how data has been growing, there's this survey done by Cisco last year amongst customers who were using [indiscernible], and they found out that raw unstructured Dat, like email 4 documents and photos has been increasing by 40 to 60 percent every year. This means that the number of objects that need to be stored, that need to be cached is also increasing every year. And data is also being accessed more. There's this new informal law, Zuckerberg's law, which states that the number of pictures being viewed on Facebook is doubling every 18 months, which means that people are not just creating this data, but they're also accessing more of this data more frequently. So it's not just that they're putting it to disk and never accessing it. So performance is actually important for these systems. So there's a reason why index operations must be fast. You need to be able to find out membership queries [indiscernible] system really fast. And the third problem, the third friend in terms of DRAM price limitations or DRAM capacity limitations are just causing indexing problems is the following. So this is data from 2011. If you look at the price of a server and vary the amount of data in a server, the cheapest, 64 gigabyte -- the cheapest server that you can get for 64 gigabyte DRAM is probably not more than a thousand bucks. But the cheapest server that you can get a single system match with two terabytes of DRAM is actually request B a quarter million dollars. And the increase is actually super linear. And not just in terms of initial cost, but also in terms of cost of ownership, because high latency data actually increases super linear [indiscernible] costs and the logic boards necessary for being able to have higher density data on your system are also more expensive on a single system image basis. So once we have all these bottlenecks, the question is can we improve indexing efficiency so we have data itself is increasing and we also have disk [indiscernible] problem. So the first system, as I said, is going to be a purely DRAM disk talk, which is trying to address these growing gaps and trying to reduce the size of the indexes itself so that with the limited amount of data, just a few kilobytes are megabytes of data, you can index terabytes of hard drive. So these are the high level goals for HashCache. Interface-wise, it looks like a simple key value cache. Your keys and values can be of varying lengths and it also needs to have a good replacement cache replacement policy for any 5 cache. The better the replacement policy, the better the hit rate, which means the performance is going to be. So that those are the requirements from an interface perspective. From the perspective of solving the problem itself that I'm talking about, we need to reduce indexing overhead, as I said. We need to answer the question of, is it actually possible to index terabytes of storage with a few kilobytes or megabytes of DRAM and it is also possible to trade memory for speed, which means that the more memory you add to index, the higher the performance should be. Is that such indexes, are such indexes possible is the question asking in HashCache. So before delving into what HashCache actually is, I'm going to give you a brief overview of how people actually build LRU caches. So various functionalities, [indiscernible] cache is one. Existence identification, you need to be able to tell if a certain element belongs to a cache. People build hash tables with chaining pointers and hash value to, you know, figure out if something belongs to the cache. And for implementing a replacement policy, they have LRU pointers across the entire data set that they have. And the location information is, you know, the place on the storage that, the location on the storage where the actual object resides. And there's other material information related to the objects themselves. So I'm going to be talking about two systems [indiscernible] comparison. One is popular open source caching called squid that has about 66 bytes of object requirement for creating index. [indiscernible] actually goes into a 20 byte hash value. And the 20 byte hash value itself is used for creating a storage location on the file system so they don't have a specific requirement for storage location itself. And [indiscernible] systems try to improve on this. They store a much smaller hash value in memory, as opposed to storing a large hash value, and there is all collisions on the storage itself. And the memory requirement comes down to about 24 bytes per object. So in HashCache, we try to rethink the cache indexing itself. The requirement, as I said, the index requires very little amount of memory during index large storage, and we design memory and storage data structures for network appliance caches for this purpose. 6 As I said, it's an in-memory index that can be used for a network of LAN caches like HTTP caches and WAN accelerators. It's about six to 20 times more DRAM efficient than state of the art indexes with good replacement policies and it was selected as one of the top ten emerging technologies of 2009 by MIT Technology Review magazine. The approach is, the approach of HashCache is the following. The bulk of the memory usage in these systems actually goes into storing pointers, pointers for implementing hash table chain collision and pointers also for implementing least recently used replacement policies. So the approach inside HashCache is to try to get both of these things, collision control and replacement policy if they're actually using any pointers. So here's an overview of the HashCache techniques. There are eight policies that are there in the paper, and we're going to be talking about three policies in the talk. But four of these policies are actually given. The Y axis is the indexing efficiency. There is essentially, you could think of it as how efficiently they're using [indiscernible]. The X axis is performance. So as you can see, there are varying policy of HashCache that can actually match the performance of open source system at ten times -- at the same time, at ten times for memory efficient. And there are also versions of HashCache that are actually five times more memory efficient than state of the art commercial systems but still be able to match the performance. >>: So is the [indiscernible] for index efficiency the reciprocal of the bytes per objects stored shown previously? >> Anirudh Badam: It's given in terms of dollars for -- it's given in terms of number of index entries that you can store per dollar. >>: So that's equivalent to -- is that a good thing I said or is that. >> Anirudh Badam: [indiscernible]. >>: So I can I read that factor of ten to mean you're storing 6.6 bytes per index object? >> Anirudh Badam: The ten times more is essentially for a given amount of 7 data, the HashCache can actually store ten times more entries. >>: Okay. >> Anirudh Badam: So that means that the index can retain ten times larger for a given amount of data. >>: Ten times more objects for ten times -- >> Anirudh Badam: >>: Ten times more objects. Okay. >>: The two aren't equivalent. Adding more disk space is very different from adding a large number of objects. >> Anirudh Badam: It depends on the objects as a sufficient -- >>: [indiscernible] 1K object is a much harder problem than caching 100 million ten meg objects. >> Anirudh Badam: Absolutely. So I'll talk about varying object distribution sizes and the first part of the talk is going to be able to object distribution which is essentially HTTP based and the second part of the talk is going to talk about object size distribution such as more conducive to network acceleration and essentially network content fingerprinting and also for storage content fingerprinting useful for deduplicators. So here's the first HashCache basic policy that basically talks about how to [indiscernible]. Let's say you have a URL and data corresponding to this URL are these four blocks of grown data that I've shown you. What HashCache does is it takes H bit hash value of the URL, it uses some hash function H bits, and it uses a part of the storage system as an on-disk hash table. What it does, it takes N contiguous blocks on the magnetic hard drive and uses it as a hash table. What it does is it takes [indiscernible] end of the hash value that you have and obtains a value D, let's say. On the T block on the hash table, it shows the first block of the data and as I said, HashCache is a requirement of varying sizes for objects. The question is -- the question, naturally, is 8 where do you put the rest of the data? If all the objects were of same size, you could have essentially had all your disks as a single large hash table. So what HashCache does is uses a large portion of the magnetic hard drive as a circular log and stores the remaining part of the data in the circular log. And the first block, the hash table, the N contiguous blocks hash table that we call the disk table, stores a pointer to the log where the remaining data is actually stored. And as you noticed, this requires no bits per entry in the DRAM, because the hash function that you have can be represented in constant number of bits that you can store in Dat um, and it does not have any DRAM required for storing. You just get the hash value and you're going using the disk as a hash table. The advantage, as I said, is normal index memory requirements and it's tuned for one [indiscernible] most objects. If you design around this cache table with a build size of, let's say, 70% of the objects are smaller than the bin size of your hash table which means that 70% of your objects can now be accessed in one seek on the disk. The disadvantages are it's actually one seek per miss, which means that misses for any cache are not going to increase your performance. They're just a nuisance, increase the latency, not useful for the application performance. And this sort of system, which is using disk as a hash table would be wasting a seek [indiscernible] and it's increasing the application performance in all scenarios, not just hits. And there's also no collision control. The collision control is implicit in the sense that if you have two URLs having the same hash value, they'd be replacing each other. And there's also no cache replacement policy. The cache replacement policy is also implicit in terms of hash collisions. So I suggest the following improvements. The first improvement that we saw to this technique is collision cruel. What we do is instead of using a simple hash table we use a set-associative hash table on the disk, and we call this, let's say the associate [indiscernible] and these are configureable in HashCache. And the first optimization of this basic policy. The second one is we would like to avoid seeks for misses. And for this purpose what we do is we store a small low false positive hash value off each 9 URL in [indiscernible]. Now, the location inside the hash table itself gives you a lot of bits about the hash value itself so you don't need to store all the hash [indiscernible]. You only need to store a very small amount of bits because the location inside the set associative hash table already gives you a fair amount of bits of the hash value. And third thing we do is we add a replacement policy and we add in an LRU replacement policy. What we do is we do LRU within each set of size T and for representing the rank of the -- the LRU rank of each element in the T size set, we only need [indiscernible] as opposed to an [indiscernible] representing global [indiscernible] information. Now, this means that the replacement policy is not global but is local to each particular set and I'll talk about how local LRU replacement policies work well in practice when you have varying different kinds of cache [indiscernible] rates. So here's a graphical representation of how this HashCache policy called setmem works in practice. You have the same set of input that the data and the URL of four blocks. You hash the data and you take [indiscernible] the number of sets that you have in your hash table and you get the value T. As I said, there are two hash tables. One an in-memory hash table that stores the hash values and one anonymous cache table that stores the first block of the data. So in the [indiscernible] set in a memory hash table what you store is a low false positive hash and you also store the rank bits. The rank bits represent the [indiscernible] rank information of each element within the set that you have. Those are stored in the [indiscernible] in memory hash table. The first block of data is stored on the in-memory on-disk hash table and the rest of the data goes on to the log like in HashCache basic policy and the on-disk hash table stores a pointer to the log. And as I said, the LRU is now within set as opposed to being a global LRU policy. In practice, we need 11 bits for implementing this -- for this implementing this caching policy. The advantages of this are that no seeks for most misses. Most misses can be answered using the low false positive hash value that you have. If the hash does not exist in memory, you need not go to the disk. 10 Optimized for -- it is optimized for one seek per read, which are hits and one seek per write, which is essentially you take a miss and go fetch the [indiscernible] from the network and you want to cache that object right now and that's right in the cache. And it's a good hash replacement policy in just 11 bits. The disadvantages is that writes still need seeks. So if you look at any sort of system that tries to optimize operations for magnetic disks, you like to coalesce all your writes and write them at the end of the log. Essentially trying to maintain the disk as a log structured system. That means that most of your random seeks are going to be used for [indiscernible] as opposed to being used for writes. So we propose the following improvements to HashCache. The first thing we do is avoid seeks for writes. Write everything in a log and eliminate the disk table. There's no [indiscernible] hash table. The natural question is where does the pointer go. Pointer to the log. The log location is actually stored in memory itself. This is not a pointer -- there is not like a regular [indiscernible] pointer. It's a pointer to the storage location so it's essentially location information. The second optimization we tried to do is avoid seeks for related objects. Let's say one of you goes to the main page of Microsoft and fetches the web page. And then let's second person goes to the main page of Microsoft and fetches the content. The first person who downloaded the content would be doing a lot of random seeks and fetching all the content. There's no reason why the second person should also go in [indiscernible] random seeks. Even objects that are on the Microsoft page that are related to each other that will be used at the same time, then it makes more sense than sorting them continuously on their disk so we do this optimization as well. Whenever you fetch content -- whenever you use content or fetch content related to the application and each express seeing that this content [indiscernible] accessed together, you actually store the data continuously on disk. And that's also an optimization that HashCache supports. So here's the graphical representation of HashCache log policy that you have. You take the hash value. You take the [indiscernible] sets of the hash value and you get the value T. There's only memory hash value. There's no on-disk cache table. The on-disk data structure is a simple log and the LRU is within the set. 11 The first -- yeah, and the [indiscernible] set in memory [indiscernible] is you store a low false positive hash. You store the [indiscernible] necessary for representing a lot of your information and you also store the location bits. The location bits is essentially stored the pointer to the location in the log where the data is stored. In implementation, we needed about 43 bits per entry for implementing this HashCache log policy. Now move on to evaluation. For evaluating HashCache, we used a system called web polygraph. It's a de facto testing tool for web proxies. They test hit rate, they test latency and transactions per second of a given cache, of a given size and of a given hit rate. And we tested all these policies. >>: In the previous slide, you're saying you use a hash value, you look in memory table [indiscernible] disk. Isn't that essentially, haven't we gone back to having a hash table with ->> Anirudh Badam: Sure. But instead of having pointers for storing a lot of your information and for chaining pointers, we don't have any pointer requirements, which means that the [indiscernible] overhead is quite low and the original point, I think he [indiscernible] should lead to higher performance actually transfers. So the highest cache basic does not have any memory requirements. The performance is really low because uses seeks for misses. HashCache setmem on the other hand stores a very small false positive hash value in memory. Gets you a slightly higher performance. HashCache log gets you the best performance but uses the most amount of memory. So it gets you that rate in terms of memory requirement versus performance. But all three are dramatically more memory efficient than state of the art open source and commercial systems. Almost 6 to 20 times. And we compare all variant of HashCache with squid and tiger, which are open source and commercial systems respectively. Tiger was commercially, is a proxy of a commercially deployed, commercial CDN that isn't used today and our test box is two gigahertz CPU, 4B of DRAM and five 90 gigabyte hard drives. The paper actually tests hard drives as large as four terabytes as well. But here I wanted to show the performance, the performance implications of 12 using multiple disks and trying to get as many IOPS as possible inside a single HashCache system. So this is how the indexing efficiency of HashCache looks. The X axis is essentially the largest, the size of the largest disk that can be indexed using a gigabyte of memory for various caching techniques. Open source and commercial are barely able to index 100 gigabyte hard drive using one gigabyte of DRAM. HashCache basic can potentially index infinite amounts of disk because it does not have memory requirement on a [indiscernible] basis. And HashCache log indexes about 1.5 terabytes of hard drive and HashCache setmem about six terabytes of hard drive for a single gigabyte of DRAM. >>: This is assuming some standard object distribution size? >> Anirudh Badam: Right. So the object distribution size comes from web polygraph, which was used by commercial proxy during the 2000 period and were getting their distributions from proxies that were deployed across many [indiscernible] and Cisco systems. >>: And do you have some sense for, in today's web, how much increasing hit rate you get from indexing a larger disk? >> Anirudh Badam: >>: Absolutely. [indiscernible]. >> Anirudh Badam: So we originally started this project trying to see if you can provide offline Wikipedia sort of systems to country like Africa and India. And in that set, I mean, offline sort of removes the scope of the involvement of the user. So that's when we thought, okay, the right match would be something which can have large amount of content offline but also have some amount of participation as well. So that's when we thought large caches actually makes a lot of sense in those sort of settings and we actually went and deployed some of those systems in Africa. We populated with a large number of Wikipedia entries. Those are not totally serves as a cache but also serves as an offline Wikipedia and offline encyclopedia sort of a thing. 13 So in those scenarios, it is useful. And the second scenario in which it is useful is this is for static content so you see this heavy tail in terms of for static content. And when the content itself is increasing in size and you want to do content fingerprinting with caching, when you want to do dynamic caching, [indiscernible] storage out of [indiscernible] you actually see higher benefit of this IMC paper by my group made last year, where he mined the logs of the system called coding that my advisor has. It's a wide area CDN deployed on top of planet lab. He did some experiments of what happens when you do content-based caching as opposed to static HTTP caching and he saw that as the disk size is increased from one terabytes or two terabytes, that he did actually increase from 16 to 27%. So you don't see that long tail in terms of dynamic content and static content [indiscernible]. But the application scenario originally was for the offline Wikipedia setting of developing regions. >>: [indiscernible] the web polygraph thing, those object sizes and distributions were based on research done in 2000? >> Anirudh Badam: >>: Right. So the web's changed insanely in ten years. >> Anirudh Badam: I'm going to come to that. The next system, which actually takes the characteristics of the modern web. This was also an ICM paper last year of a group-mate of mine, and it shows the difference between how web changed from, you know, from the web 1.0 days and how object-sized distributions have changed and how it actually fixed the performance of these sort of systems. So in terms of memory usage, here is what HashCache does. With just 80 megabytes of memory, it provides about -- it is about five times more memory efficient than state of the arm commercial system and still be able to -- and is still able to perform well. Still able to compare in terms of performance. And there's also a version of HashCache which has 100 -- probably 80 times more efficient than an open source system and still be able to perform comparatively. So the normalization of memory requirement is in terms of the highest performing HashCache requirement, which is 1X and the memory requirements are 14 shown in terms of the memory requirements of HashCache setmem, HashCache log. >>: As I look at this, you're getting measurement in performance of requests per second per disk. But I haven't seen how you handle multiple disks. >> Anirudh Badam: So simply strike blocks across multiple disks, because it essentially is a hash-based index. You can do some sort of trivial lower balancing across multiple disks. You can even strike blocks across multiple disks and try to get performance in parallel as well. >>: Okay. >> Anirudh Badam: So we've had some deployments in Africa. We still have a deployment going on in Uganda, and these are the photographs from those deployments. The question is how scaleable is this model in terms of if you have just a purely DRAM and disk system, you optimize your data structures so that you can bridge these gaps. I mean, for how many applications can you do this, and is it even possible to do it for all the applications. When do you sort of run into the actual limitations for an application in terms of you need more DRAM for performance sort of a thing. So scenario where this can happen is when the object size is much smaller and even with a very nice technique like HashCache, you do not have enough DRAM to store all the index. If your object size, let's say, 128 bytes or 256 bytes, if you're doing aggressive network caching or aggressive deduplication for your storage system, this could happen. Very small object sizes. And for these scenarios, you still need indexes that are much -- that could be much larger than the index that you have. So as I said, I mean, there are these disk properties that have gone better. So if you look at how much time you actually need to exhaust the entire disk by using random seeks, it used to take 600 seconds back in 1980. So reading 30 megabytes using random seeks of four kilobytes, eight kilobytes each takes 600 seconds. But now it takes 270. That's almost [indiscernible], right? So what this means is that even though capacity has increased, random seeks not 15 so much. So this is 40,000 times worse today and this scenario does not actually go away if you use faster disks. Even if you have 15,000 RPM disks, you probably only get twice as many random seeks per second. And the capacity -- and there's also flip side of using a faster disk, because then the largest capacity disk that you can get for a higher speed spindle is also much smaller, because it needs to, you know, it's a mechanical device, and it can only do so much, the motors. So I'd like to reproduce this graph from the beginning of the talk where there's super linear increase in terms of cost of having more amount of data in a single system image, but there's a nice price point here. There's [indiscernible] technology, flash memory. If you look at high speed flash memory devices like fusion IO, you can get a ten terabyte SSD today for the price point that you won't even get even a single terabyte of DRAM inside a single system image. So the second part of this talk is going to be about new memory technologies like flash and here's a small primer of flash memory. You get about ten dollars a gigabyte. But if you're doing bulk, probably can get five to eight dollars as well. And it's about 1 million seeks per second per PCI bus that you have if you're multiple bus in the system, like a [indiscernible] machine. The IOPS actually scale through the number of buses that you have. And it's also very high density. You can fit as much as 20 terabytes as flash in a single rack unit. And this also actually scales with the number of buses that you have. So the more buses you have, the more PCS slots you have, the more 20 terabyte hard drives you can have in a single rack unit. And the advantage is that there's no seek latency, only read and write latency. And there's also a flip side which makes things more interesting, I guess. Writes happen after an I racy, and these erases are usually 128 kilobyte blocks. And the numbers of areas that you can do is also limited in terms of, let's say, you could probably only do 10,000 [indiscernible] per second of today's state of the NAND flash devices. So reducing the number of writes helps and reducing with performance and also the reliability of the device. So if you want the device to last longer, you would want to do as little of these [indiscernible] as possible. So now the question is, you know, now this you have this technology which falls in terms of performance and capacity in between these two technologies, how do 16 you want to use it? Do you want to use it as a fast disk, or do you want it use it as a slow memory. So this is a simple experiment that we did. We took my sequel in [indiscernible] and then my sequel in a box. The first bar over there shows you performance of what happens with TPCC when you don't have any flash in the system, okay. That is normalized to performance 1. Let's say you add somewhat of a flash into the system and you use it as a transparent block cache for the magnetic hard drive and you download the TPCC benchmark on [indiscernible]. Also scored as modified. You get about 60% performance improvement. 55 to 60 percent performance improvement. Now, you take the same system, instead of using the non-flash device as a transparent block cache for the magnetic hard drive, if you use it as slow memory, as you configure it to use -- you configure my sequel to use the flash memory device as buffer pool, you get a much higher performance. This is very [indiscernible]. In either of the cases, no source code modified. In one case you are using it as disk and in one case you are using it as memory. The reason this happens is the latest structures that you build for memory are more flash aware in general and the kind of optimization that we've done for [indiscernible] do not apply for flash memory and they add some latency in terms of software. And that leads to these overhead in terms of using flash as slow disk as opposed to using [indiscernible]. So there's a reason why my proposal is to use flash as slow memory as opposed to using them as fast disks. So if you look at state in the art in terms of how one could use a flash memory disk as memory is use it as a swap disk or a map file on the device and there you go. It's available by [indiscernible] flash memory device. The advantage is that applications which are using [indiscernible] map need not be modified. They can certainly have large amount of memory. They're just flash banked. And it does not -- but the disadvantage is that it does not address SSD problems. It is not flash aware. It does not do any write optimizations. And as we look to a lot of random writes because random writes are not a problem with respect to [indiscernible] memory or, you know, traditional physical memory technology, which is DRAM. So to address these bottlenecks, to address these issues what people usually do 17 is they [indiscernible] application. They optimize the applications for writes on top of flash memory, and the disadvantage is that you don't only need application expertise but you also need SSD expertise. You're modifying the application to be more flash aware and flash friendly. So the premise we had [indiscernible] can we get best of both worlds. That is, application need not be modified. That's the ideal case. But is there a case which the number of modifications of the applications have minimized, but you're still optimizing for writes. So this is the premise behind SSD [indiscernible] and we tried to solve this problem. So we started off with a simple approach, a non-transparent approach. What we did is we created a custom object storer in a flash memory device, which is a log structured object store and the log structure object store simply gives you custom pointers to create, use, read and write to these objects that you are creating on top of the SSD. And you read and write objects as necessary. As I said, the SSD is managed as a log structured object store. It not only minimizes the number of areas that you have on the SSD, but also optimizes for writes because the object -- the writes are now done to objects as opposed to being done to blocks or pages, like virtual memory systems do. But the disadvantage of this approach is it still needs heavy modifications. You need to know where the objects are there in your system. You need to appropriately read them and write them too from the SSD. For example, HashCache, when we tried to convert HashCache to this type of object store sort of a model, where we were storing each set in [indiscernible] HashCaches object on the SSD, we needed [indiscernible] lines of code to be modified within HashCache. That was about five percent of the entire HashCache, just the indexing part, not the HTTP proxy part. Now, there's a fairly large number of lines of code. I build HashCache myself so it was easy for me. But for someone who is not familiar with the application, and for someone who is not familiar with the SSD, it might take much longer and it might be much harder. So that's when we thought of this technique of can we use the virtual memory hardware? Can we use pointers themselves to figure out what objects are being read to and what objects are being read from. Can we use virtual memory to 18 assist as the object handles. That's when we had this idea. So you could use page faults to detect reads and writes, which data is being read in a system, and on fault, you materialize the data from the SSD and manufacture a page. And you remove idle objects to the SSD's log, right. But the problem is that, you know, pages are 4 KB in size. Virtual memory systems, any modern virtual memory system does not give you access information of anywhere less than four kilobytes. There were systems which were doing this for one kilobyte and much smaller in size. They do not exist anymore. And all the [indiscernible] systems, the smallest [indiscernible] you can configure is four kilobytes. And so then we had this wild idea of aligning objects to page boundaries. So instead of having contiguous virtual memory allocations, why not give each page to each object and the virtual memory system itself will tell you which page is being modified. And if you have this model called one object per page, you know precisely which object is being modified. The rationale behind this crazy idea is that virtually memory is cheap, right. So you have 40 [indiscernible] of virtual memory space and this is much larger than the physical memory that you have in the system. But physical memory is not cheap. So you still need to configure -- you still need to build a system size that even though there is frivolous use of virtual memory, there is no waste of physical memory. So that's the technique in an SSDAlloc. >>: Virtual memory is not that cheap either. per page does to your [indiscernible]. I'm wondering what one object >>: Absolutely. So the premise we had in SSDAlloc is it's for improving the performance of network appliance caches. So the network appliance caches are usually bottlenecked boy the network latency and other aspects as opposed to just being CP modeling. So for these sort of systems, let's say you create all your objects, let's say we're talking about something like memcache. Inside memcache, you have a hash table, which have lower values. If you create your values using SSDAlloc, and you create your hash cable using Malloc, the good thing about SSDAlloc us is it coexists with Malloc so if you have some way of configuring your TLB in terms of data to come out of Malloc and your IO intensive data to sort of come [indiscernible] and network data to 19 come out of SSDAlloc, you'll not be seeing so many TLB misses. The TLB misses will be equal to the, let's say the IOPS of the memcache box itself, which most modern CPUs can actually handle fairly well. I'll come to the overhead of SSDAlloc in more detail. So here's an overview of how SSDAlloc works. Let's say you have a memory manager like Malloc. In this case, it's SSDAlloc. Let's say you create 64 objects of size 1 kilobyte each, okay. What a traditional Malloc memory manager would do is it would take 16 four kilobyte pages, spread it into 64 one kilobyte objects and return these pointers to the application. What SSDAlloc does is it creates 64 different four kilobyte virtual memory pages and allocates these 64 pointers as these one kilobyte blobs of data to the application. The three kilobytes portion in each of these virtual memory pages is not utilized for allocation in future. So clearly, I mean this frivolous use of virtual memory can lead to wasting of physical memory so what we propose is we have a very small page buffer. We restrict the number of pages, the number of virtual memory pages that are recorded all point of time to be a very small quantity. Let's say five percent of your data, which is virtually memory pages that are resident in core at all points of time. The remaining 95 percent or remaining large portion of data is actually used as a compact object cache, where these objects that you are creating using this memory allocator are stored compactly using custom hash table interface. And because the virtual memory system tells you which page has been modified, you know precisely which objects have been modified, which means you can implement the SSD, use the SSD as lock structure object storer. Now, that is the rationale behind -- this is how the SSDAlloc system works. So to implement the system, what we did was we -- all the virtual memory that is allocated to the application is protected by [indiscernible] protect, and when the application tries to access any of this data, it takes a fault. In the fault handler, we figure out where the object is. Is it currently in the ram object cache or in the SSD. If it's in ram object cache, we've manufactured the virtual memory page immediately and return the control back to 20 the application and the fault handler. And if it's not, we fetch it from the SSD, cache it in ram object cache, manufacture a page and put it in page buffer and return the control back to the application. So coming back to Andrew's question, this is the overhead in SSDAlloc. So we took a 2.4 gigahertz quad core CPU, about 8 megabyte L2 cache. Just for TLB miss, so is not a miss that goes for DRAM, but just for the hardware to figure out that this is a TLB miss, it takes about 40 nanoseconds. And this translation for SSDAlloc which translates virtual memory into SSD locations, that's about 46 nanoseconds. Page materialization, which is take a new physical memory page, copy the data from ram object cache into this physical memory page that takes about 138 nanosecond. And the combine overhead of all these process, manufacturing a page on a fault is about 800 nanoseconds. And the bulk of the overhead is still NAND flash latency, which still about 30 to 100 microseconds. So if you're still bottlenecked -- if your application performance is still bottlenecked by the random IOPS of the NAND flash device, then you're not so much bottlenecked by TLB itself, because the NAND flash is probably doing somewhere between 300,000 to 400,000, even a million IOPS per second. And in those cases, the TLB is not going to be the bottleneck. So here's a results overview. We modified four different systems memcache, a boost library B plus tree, a packet cache, which is essentially caching 800 packets inside a network and trying to do a network [indiscernible] and we also modified HashCache. The modifications required only 9 to 36 lines of code. Them are simple [indiscernible] SSDAlloc. But some of the systems are more complex, for example, memcache, what it does is instead of Mallocing individual objects, creates its own memory allocator. Each time, it allocates a large two megabyte or [indiscernible] portion and wills its own memory allocator. So to modify such systems, we needed more [indiscernible] lines of code. So in terms of performance, SSDAlloc obtained up to 17 times more performance than unmodified Linux swap. And up to 4.5 times for performance over log structured swap. So clearly Linux, the native Linux swap is not a good comparison point because [indiscernible] is not build for flash memory devices. 21 Originally built for disks. And in the disk world, once you started swapping, performance really did not matter. It was more like a barrier before killing an application. It was like a worst case performance sort of a thing. So that's why we modified the Linux swap to be log structured, where it was storing the pages in a log structured fashion. We also showed this as a comparison point, and the reason why it performs 4.5 times better than modified Linux swap, such as log structured swap is because a write is actually restricted to object sizes, as opposed to marking data [indiscernible] which means your writes are much lower for random writes. And most of these systems that we modified are hash based systems, which means the writes are going to be randomized and reads are also going to be randomized. And because our writes are restricted to object sizes, and we actually cut up the size of the object at 128 bytes. So you might be wondering for the one object per page model, not only TLB, but your page tables are also going to be larger, right? So if you have, let's say, two byte objects, five byte objects, you don't want to create a page [indiscernible] entry for a five byte object. So what we do is we cut off object size at 128 bytes and if you have an object requirement of less than 128 bytes, we cleared those objects from those 128-byte pages. So we build a coalescing allocator on top of these 128 bytes, byte pages and for any object size that is less than 128 bytes, we created from the hard memory allocator, which means that the smallest object size that you can have in SSDAlloc is 128 bytes. And if you have random writes, it means that it's going to write [indiscernible] times less data to the SSD because in a standard virtual memory system you're going to be marking data [indiscernible] granularity. So here are some micro benchmarks to enhance or to shot benefits of SSDAlloc. What we did is we took a 32GB array of 128 byte objects and we created these objects using SSDAlloc and Malloc in in unmodified Linux swap and also log structured Linux swap. And we varied the number of reads and writes in the system. We started off with all reads. And then we went on to all writes. So if you look at the through-put of number of requests per second that you have in terms of thousands of [indiscernible] per second, these are normalized 22 performance increases. If you look at how SSDAlloc performs over unmodified swap, the performance is anywhere between about 2.5 to 15 times, 14.5 times. That is, as the writes are increasing, the performance of SSDAlloc is much better. This is [indiscernible] in terms of SSDAlloc requiring to write much less and doing sequential writes as opposed to random writes that unmodified swap does. And if you look at the SSDAlloc's benefits over log structured swap, the benefits are still in terms of having to write much lesser for random writes. This is because our writes are restricted to object sizes as opposed to four kilobyte pages. And we verified these results against five different SSD vendors in the paper. I encourage you to read the paper for a detailed set of results. And here are some benchmarks for memcache. We took 30 gigabyte of SSD and 4GB of DRAM and four memcache clients were running or essentially sending requests to this particular memcache server. We modified the slab allocator of memcache, as I said, to use SSDAlloc, and these are the performance results. The X axis is the average object size inside memcache. As the average object size decreases, the performance of SSDAlloc increases. But if you look at the performance of SSD swap and SSD log structured swap, the performance does not actually deepen on the object sizes, because regardless of what the object size is, they're marking the pages, the entire four kilobyte page is dirty, and there would still be reading an entire four kilobyte page from the SSD even though an object is required. And this for a random workload like memcache. So here's some of the current work that is going on that I've been doing, trying to optimize SSDAlloc. The first thing that we are working on is rethinking virtual memory management itself. SSDAlloc, obviously there's this problem of 95 percent of your physical memory is available only after a fault, and if you're working set sizes [indiscernible] working set size can actually fit in DRAM, this is clearly a nuisance. So for this purpose, what we do is we still want to maintain this one object per page semantic. But we don't want the old [indiscernible] of having to take fault for accessing a huge portion of your DRAM. So what we do is the 23 following. What we do is we cleared sparsely mapped virtual memory pages. Let's take these four virtual memory pages. Each of them are actually sparsely mapped. The first one is mapped to the application in the first kilobyte. The second one is mapped in the second one. And the fourth one is mapped in the fourth one. Now, these virtual memory pages actually mapped to the application in non-interleaving portions. Which means just by using some page table magic, they can actually share online physical memory page, which means that you get -- you're still going to get the advantage of undetecting reads and writes and object granularity, but you'll still be able to back all your physical memory using useful objects without actually having any page faults. >>: At the top, you you've got a virtual memory page, but you're not using all that page, only using part of it. >> Anirudh Badam: Right. And the usage is determined in Malloc. So if the operation called for Malloc of one kilobyte, I'm going to give him one kilobyte, and the contract between the Malloc user and the Malloc implementer is that you're not going to use the data beyond these two pointers. >>: So they're not overlapping. all sort of fall into place. So when it maps down to a physical page, they >> Anirudh Badam: Right. So it still gives you pages which are smaller than four kilobytes, which means you can do all the SSD optimizations that you want, but your DRAM is still not wasted. So whenever these sparsely mapped virtual memory pages are in core, they're going to be shading physical memory. That's the advantage. So that's the first work that is ongoing work. And the second ongoing work is trying to do more virtual memory centric tiering. So what happens in traditional operating systems is that you don't have a rewards pointer from physical memory page to the virtual memory page. So traditional operating systems, what they do when they have to select a candidate for paging out, is that they have some sort of usage information, statistics, on a physical memory basis. They don't have virtual memory usage basis. So when they have to page out a page, they need to go to each individual process, scan through the virtual memory regions and see which of these pages 24 has not been recently used. Now, when they find such a page, the page [indiscernible] so this is a CPU intensive process. There's a scanning of all these virtual memory pages that is going on. So in TEAM, what we did is we tried to make it more virtual memory centric. We maintained virtual memory page usage statistics. So for all the virtual memory pages across all the process, we maintained statistics, clean statistics, and these statistics are actually used and this reduces the CPU requirement of implementing page replacement policies in our operating systems. And the second optimization we have done as part of our TEAM, which is TEAM stands for transparent expansion of application memory, by the way. The second optimization we have done in TEAM is that virtual memory page tables are already translating some sort of [indiscernible]. They're translating virtual memory [indiscernible] to physical memory [indiscernible]. And flash memory FTLs also do some sort of translation. They translate logical block [indiscernible] to physical block [indiscernible]. So there's two levels of indirection that is going on in the systems where you would like to use flash memory devices as a virtual memory -- by a virtual memory. So we tried to introduce this overhead and we implemented an FTL fully using virtually memory page tables and as a sparse FTL. And that's the second part of the work. There's a third thing that we're trying to do. Trying to introduce some sort of -- some amount of flash memory inside a cluster. And it's more like the cluster is implementing a block device. A block device comes from a cluster, and we are trying to do some sort of custom clearing within the cluster. So we're trying to do, what we're trying to do is let's say you have a block that is cached in DRAM on a particular machine. There's also a block that is cached in DRAM -- the same block is actually cached on DRAM of a different machine. Now, if these two systems are not talking to each other, they might make suboptimal decisions of how to implement page replacement policy. So we're trying to see if [indiscernible] can be done on a cluster basis where these systems are actually communicating with each other with respect to axis frequencies and they're doing some sort of collaborative tiering so you don't make these decisions independently from 25 machine to machine. So there's some ongoing work and this is some future work that I have. So there's some amount of work in maintaining page tables for multitiered systems that are using large amounts of virtual memory. Because we are talking in terms of secondary faults, high level faults, these multiple threads need to modify page table scaleability so there is some interesting work that can be done in lock less data structure type management of page tables inside operating systems. And there's also some work that we can do in terms of how to tier flash memory. Can look at two different models, one is the SAN model and the other is the Hadoop model. And somewhere in between as well where there are some nodes [indiscernible] which have more amount of flash memory than other nodes, which means that they're more [indiscernible] for data. And the third is in terms of reliability of transparent tiering system. If you have transparent system, if the data is corrupted in one layer, it will eventually be corrupted in all the layers, because the data still needs to trickle down. So we need to try to figure out methods in which, you know, the transparency link systems are more reliable and more secure in terms of what kind of safety barriers you can [indiscernible] in terms of managing your tier in the second level tier, the flash level tier. So here's a summary of the talk. HashCache seek-optimized cache index with an LRU-like to ten times more efficient than commercial to 50 times more memory efficient than open is a DRAM efficient and replacement policy. It's about six systems, memory efficient, and 20 source systems. It enables large disks with only net book-class machines. And SSDAlloc is a complementary technique which is useful for building such hash tables, where if the hash table size grows to much larger sizes, you can use SSD without actually having to modify your application, and the modifications are restricted to only the memory allocation portions of the core. And in most of the systems that we modified was only 6 to 30 lines of code. And the systems and the performing three to five times better than existing transparent ways of using flash memory as virtual memory devices. And also, SSDAlloc also has the benefit of increasing SSD life by up to 32 times. Now I'll take questions. 26 >>: So you said that it performs a lot better than existing transparent techniques. Can you relate it to existing techniques that are SSD aware? >> Anirudh Badam: Sure. So in terms of this like I have this example. So let's say you are implementing a [indiscernible]. One way you could implement a [indiscernible] is take an existing [indiscernible] library that is built for DRAM and use it as a swap device. And the second thing you do is instead of doing these random writes in a [indiscernible] filter, at a 4 KB granulator, you can do it as a much smaller granulator using something like SSDAlloc. This is clearly beneficial, but in terms of -- SSDAlloc is clearly not aware of all the applications trying to do. So if the application developer is aware of a better data structure which supports this sort of a thing, you can not only implement a data structure using this interface, but also the native storage interface. Now, we propose using virtual memory as the interface for flash memory, only because if you look at new memory technologies like face change memory and other memory sort of technologies, going forward, I mean, these technologies are growing lower in latency. It makes more sense to build this system to use as a memory, as opposed to using it as storage. So this is a complementary technique. This enables users of flash memory on a wide variety of systems. But data structures are really still important. If your application developer can build a smart flash of a data structure, you have to do it. >>: That makes sense. Can you give me an idea of, I mean, are flash aware data structures going to perform many orders of magnitude better than the transparent things? I mean, in other words, have you gotten most of the benefit or have you gotten only a fraction of the benefit you can get by -- I mean, if the core of the application is one or two data structures that can be rewritten, that can buy you additional order of magnitude again. Transparency is nice, but really when you're all done, you're going to want to actually use the custom [indiscernible] structure. >> Anirudh Badam: So there's still a reason why you -- I mean, a lot of people will spend a lot of effort in going and building the cache line of software. I mean, even in a purely DRAM world, you modify your data structure so that your 27 caches are being used appropriately. So in terms of building a data structure appropriate for the underlying memory technology, those sort of techniques will still be important. So coming back to the bloom filter example ->>: Are there additional orders of magnitude there, or -- >> Anirudh Badam: So in come back to this bloom filter example, you could build a bloom filter in a regular way for using small page sizes, or you could build a bloom filter such that all your [indiscernible] functions actually reside in the same block, right. You could do some additional randomization to make sure that all your [indiscernible] functions need for the bloom filter are in the same block. Now, this actually gives you, really gives you K times performance improvement. So depending on what the [indiscernible] of your bloom filters, you'll get such a benefit. So I guess, I mean, this makes flash memory useful in terms of memory, because building in memory data structures is much easier than building on disk data structures. So in terms of development, it's going to reduce your overhead. Performance will still heavily depend on the kind of data structure you build and how flash aware it still is. >>: Thank you. >>: Seems like most of the complexity in SSDAlloc comes from the fact that you want to support existing on modified code to the extent that somebody calls an allocator, gets back a pointer. And at any point in time in the future, they can just go [indiscernible] with the state that that thing points to. So that means you need to track updates through the virtual memory system [inaudible]. And so two questions I have. One, am I correct in thinking that this is strictly worse than a system in which you have like an explicit access this object, can I get a copy of it, now save this object? >> Anirudh Badam: Um-hmm. >>: Have you considered whether an API like that -- like I'm thinking that it would be not be very hard, either through compiler transformations or manual code rewriting or a little bit of hacking to take some existing data structure 28 and modify it to insert these cores to say, you know, require release, for example, you could do that when you're requiring release. Logs, perhaps, so you could use TM techniques. You could do a little bit of magic that would make a lot of the overhead of doing it completely at virtual memory system go away. >> Anirudh Badam: Definitely. In terms of what semantics that the programming language it has, there are multiple optimizations you can do. One is let's say you have some sort of transaction semantics. Then you can comfortably get away with your ram object cache and the other requirement. If at the beginning of, let's say, a function called your [indiscernible], these are the sort of objects I'm going to be using. Let's say, you know, you are quite -- the handle still could be virtual memory, but it will get away from some of the overhead that SSDAlloc has. So in trying to be completely transparent, yes, we definitely had to jump through a lot of hoops. But in trying to be less transparent, I mean, it's sort of a trade-off. So if you ever want to use custom object handlers or you want to be [indiscernible] to existing applications which, you know, have used virtual memory and then try to make their life easier. >>: I guess my question is similar to John's. Do you have any sense in the spectrum of the achieved performance where these different tradeoffs sit? >> Anirudh Badam: Right. So we did a spec benchmark sort of with SSDAlloc. And we also did a spec benchmark with the chameleon system I was talking about, where all the physical memory is actually accessible using these virtual memory pointers. And so in that case, we actually saw a performance increase of about 16 times. And this is purely in memory benchmark and does not use the SSD at all. But by using SSDAlloc-like technique, where you are taking fault to actually access let's say 9 percent of your data, that was actually causing the problem. >>: I guess my question was different, which is if you were using SSD. What's the overhead of doing it in a completely optimal handwritten data structure versus using it if you're prepared to tweak your object access routines, versus this? How do those things ->> Anirudh Badam: So one pathological example that I gave, I just give was the 29 bloom filter example. There are these flash over bloom filters. What they do is all their K hash functions actually fit into the same log. They use additional [indiscernible] for this. And if you use a regular bloom filter, which is not flash aware, which is just built originally for DRAM, what you could -- the K random writes could actually translate to K SSD random writes sop there you write [indiscernible] of K times performance improvement, not only in terms of number of writes but also in number of reads. So these sort of examples that certainly, so in terms of what is the right thing to do build a flash over data structure and what is the right interface. People are fairly familiar with memory sort of building. I mean, they find it more convenient to build in-memory data structures as opposed to building storage-like data structures. So that's the reason why we propose [indiscernible] to face moving forward with new non-volatile memory technologies. >>: In the SSDAlloc part of the talk, you mentioned different vendors. Our market, we do see wide variety of different device. It's very different performance characteristics. And your work tried to integrate SSD to virtual memory system, probably is not many vendors to consider. I guess basically wondering how the performance differentiation goes like across vendors systems. >> Anirudh Badam: Sure. So there are some flash memory devices which actually do more random writes than random reads. This is because they're bottlenecked by [indiscernible]. And then if you're doing a lot of random writes, they're doing, they're still doing a lot of sequential writes on to the SSD. And for these sort of cases, I mean, because they have a fully -- well, fully associated FTL. So the performance simply [indiscernible] down to the kind of FTL that they have. So if your FTL is not optimized for random reads, random writes, the performance would be very high with SSDAlloc. But if your FTL is actually fully mapped and can do random writes very well, the performance benefits will boil down to the object sizes. Let's say your object size five [indiscernible] and the virtual memory system is actually mapping at four kilobyte range, four kilobyte dirty. Then in the SSDAlloc case, you would still be writing eight times less data. So even in those cases, you get the benefitted. So regardless of what the FTL 30 looks like, you end up getting benefits from the SSDAlloc model. >>: Another question. On the SSDAlloc, how much is additional basically in memory data structure you need to maintain, basically to consider this fragmented object. >> Anirudh Badam: So the page table overhead for SSDAlloc, virtual memory to actual SSD locations that you have. So for the page tables, yes, I mean, each page table entry is probably 8 by ten size for 64 bit systems. And in those cases, the page levels the data structure of the [indiscernible] SSDAlloc is [indiscernible]. So the page levels can actually reside on the SSD itself. So if you have, like, traditional virtual memory systems if you have a four-level page translation mechanism, then many of these pages required for representing the page table itself can decide on the SSD itself. So depending on the working set size, if your working set size is small enough that it fits in DRAM, then you wouldn't see much of overhead in terms of page table -- the overhead of storing these mappings in DRAM, because the working set size is small. Hopefully the page tables themselves will [indiscernible]. >>: So, I mean, in the beginning you mentioned DRAM. large SSDs. Ten terabyte SSD. >> Anirudh Badam: Right now, we have very Yes. >>: So if you're actually basically mapping such a large DRAM space, what is the data structure size? How much is that ->> Anirudh Badam: So with the current version of the fusion [indiscernible], a ten terabyte hard drive would have about 32 gigabytes of FTL requirement and the current fusion hard drives actually don't have a [indiscernible] FTL, which means that all this FTL is going to be data. So I think this summer or by the end of this year, they're coming up with an FTL that actually resides on the SSD. Only the mappings that are currently being used by the application or by the OS are actually stored in DRAM and the rest of the applications are stored on the SSD. So yes, moving forward, if you want larger flash memory devices, you need to think about mapping overhead. And mapping overhead can be redundant as I 31 mentioned in one of the TEAM systems is by not storing page table mappings and [indiscernible] mappings, trying to combine them, that is one sort of optimization you can do to reduce mapping overhead. The second optimization you could do is make the mapping data structures themselves pagable so that you do demand fetching of the mappings itself. >>: How long is it to [indiscernible] SSD? >> Anirudh Badam: So if you look at the fusion IO drive, it's an 80 gig hard drive. It's [indiscernible] for five gigabytes of writes and as the latency goes farther downward, what ends up happening, these companies would like to give you warranty for three years. But as you go to higher capacity, the number of areas that you can do on these devices is decreasing. So right now, they're at a point where five petabytes is actually not enough for three years. If you're doing writes 24/7. But there are not many applications, fortunately, which are doing writes 24/7. If you look at more realistic applications from, you know, the logs from the data center of their customers, they can still last five years on the shelf. But I guess once they move to [indiscernible] devices, they're going to have this problem of using these MMC devices which are rated for only 5,000 cycles. So what they're doing is so instead of giving you purely SSD devices, what the flash vendors are doing is they're taking MLC devices, which are -- so if you look at, let's say if you want 100 gigs of SLC and you take 100 gigs of MLC, you would think the price of the MLC is certainly cheaper. But what ends up happening is the economy [indiscernible] are different. So MLC is actually cheaper than SLC, not only in terms per byte. Not only twice per byte, but actually much lower than that. So what these companies are doing is they're instead of buying SLC, they're buying MLC and using MLC as SLC. So they're not using their two levels inside of MLC devices. So actually gives them slightly higher number of write cycles as well. So when they go to [indiscernible] really low number of [indiscernible], you're getting something like 8,000. So maybe for the next four or so years, they can maintain the same data. You can write five petabytes, but it will still last for five years sort of a number. 32 >>: A quick question. I'm wondering if [indiscernible] SSDAlloc in language manages where you might be able to track object use without going down to the DRAM but also not allowing changes of the language. >> Anirudh Badam: Right. So what we can do is take something like JVM or modify JVM to use something like SSDAlloc and then you readily have best of both worlds. JVM itself is using the SSD in object oriented fashion, and you also have this benefit of being able to translate pointers independently without having to change things for the application. So you can definitely do that. And instead of doing this route, what you can do is you can go modify the JVM itself to be flash aware. Because they're all custom pointers and you also have an indirection table within the JVM itself, which is mapping the java pointer to a particular virtual memory pointer, let's say. Those can be transparently modified as well. So we tried doing this inside JVM and we essentially went inside JVM and try to see what happens. And it was not so straightforward. JVM also has a coalescing memory allocator that creates large five megabyte portions and then splits those into objects and so it was much more complex than that. >> Jeremy Elson: Okay. Thank the speaker.

1

Related documents

Products

Support

1

Related documents

Add this document to collection(s)

Add this document to saved

Suggest us how to improve StudyLib