>> Jon Howell: Alright, good morning everybody. It’s my pleasure to welcome Yiying Zhang. She is a student with the Arpaci-Dusseau’s at Wisconsin. She’s interviewing for a post-op position with the Systems Group. She is well decorated with several fast papers and USS paper. She’s going to talk to us today about de-virtualization, so. >> Yiying Zhang: Thank you for the introduction and thank you for coming for my talk. I’m Yiying Zhan from Wisconsin-Madison. I’ll be talking about de-virtualization in storage systems which is the work I did with my PhD work. So we know that virtualization is a common technique to transform physical resources into an abstract form that can be more easily accessed. Virtualization has been used in many systems. For example, physical resources, physical memory is virtualized into virtual memory to provide just spaces for different processes. Hard disks, its internal cylinders and patterns are also hidden and virtualized into the logical block interface. For the flash based SSDs its internal structures and [indiscernible] also virtualized a behind the same blocking interface. We can see that virtualization provides simplicity, flexibility, and usually a uniform interface to different clients. To realize virtualization, the technique of indirection is usually at, used by referencing an object with a different name. I think this example if you want to virtualize object B to A, we can add an indirection from A to B with a mapping table. Such mapping table is used widely in different systems. Like page tables in virtual memory management, flash translation layer is for flash-based SSDs, also remapping tables for hard disks and also RAID constructions. So we know that virtualization is good. But my question is have we taken it too far? That redundant level of virtualization is added in a single system, a problem we call excess virtualization. I think this gamble and the object A could have been directly mapped to object C, instead two levels of virtualization are added. Such excess virtualization having many systems; a good example is when we run an OS on top of a hypervisor. Another example is when we run a file system on top of RAID or SSD, so each will maintain their own logical address mappings. In all of these examples there’s an information redundancy because of excess visualization. Imagine now if we have a file system on top of virtualized, sorry, but the real problem is that there’s a cost with excess virtualization. First we need mapping table at each virtualization layer to start the mapping tables. These are always; usually start in main memory or more costly device memory. Also there’s a performance cost to access and maintain the mapping tables. So we can see that excess virtualization is redundant and cost both memory and performance. My question is; is there any way we can make systems work together so that we can more compactly represent the redundant information? The technique that we proposed is de-virtualization which collapse multiple levels of virtualization. I think this example, we collapse the two levels of virtualization and remove the mapping from B to C and directly map from A to C. Now imagine if we have a file system on top of our virtualized device. First we have a, for our block, we first map from the file offset to a logical address, use our file system data structure. Then from logical address to a physical address use device mapping. What we do is remove the device level mapping and directly map from file offset to physical addresses. So with this basic idea, I first begin my research with a single hypothesis that with the right interface we can remove the redundant virtualization. With this hypothesis I designed a new interface called nameless write, nameless write is like a normal write but it only sends data and no name or logical address to a device. Then the device will allocate a physical address and returns to physical address to the file system. Then the file system will store the physical address for future reads. I implemented nameless writes with Ext3 and then emulate it SSD first. But then I found that nameless writes change many aspects of the software, hardware, and the interface. So I was curious to see if an idea still hold with the real hardware, and I built a hardware prototype on top of a testing SSD bod. Then with the hardware experience I found that there are actually many problems with the nameless writes interface. Basically because nameless writes is fundamentally difficult to integrate into the existing outer interface. With these ideas I’m now building a new tool called File System De-virtualizer which doesn’t have the proper complexity of nameless writes and can still use the existing IO interface. But what it does is it can dynamically change file pointers to point to physical addresses. So it will fit better for dynamically removing mapping table space. This is an ongoing work so I don’t have satisfying results to show you. But overall we found that de-virtualization removed device RAM space by fourteen to fifty-four percent. Also it improves random write performance by twenty times. >> Jon Howell: Percent or times? >> Yiying Zhang: Times, times, sorry, fourteen to fifty-four... >>: In fourteen X, can you give us essentially the scales from fourteen kilobytes to one kilobyte or from fourteen gigabytes to one gigabyte? >> Yiying Zhang: I think it’s from maybe one giga, one GB to, yeah less than that. It depends on your device size actually. >>: [inaudible] >> Yiying Zhang: And I also learned a set of lessons in this experience. First with the right interface excess virtualization can be largely removed. But adding a new interface can be hard and also it’s even more difficult with real hardware. But with a lightweight tool it can be done with more flexibility. So in the rest of my talk I’ll first go over some background of flash-based SSDs. So de-virt, the technique of de-virtualization applies to different types of storage devices. But SSD is a major use case so I will talk more about SSDs. Then I will talk about a new interface, and software and hardware prototype of it, then the new tool that I’m building now, and finally future work and conclusion. So I know many of you already know a lot about SSDs and I apologize for more redundancy. But I will try to go quick. So SSD is a device that provides a block interface and inside SSD there’s a controller, an internal RAM, and a set of flash memories. Then within the flash memory it has a set of erase blocks and each erase block has a set of pages, and each page is associated with out of band area which stores things like raded bits. There are three operations for flash memory, read, write, and erase. Reads and writes happen at a granularity of flash page which is usually two, four, or eight KB. The property of flash memory is that you cannot write a flash page without erasing it. So for this example if you want to write to the first page without erasing the first erase block then there will be an error. Such erases happen at a granularity of erase block which is usually two fifty-six to one MB. Then with these erases flash blocks wear out. So with different access to different blocks certain blocks may die sooner than the others. To prevent certain blocks from dying too soon the technique of wear leveling is used to make, it erases wear out similarly. To virtualize SSD, Flash Translation Layer or FTL is usually used to give you the block interface and to hide the internal operations. It usually uses the mapping table to map from logical address to physical address. Such mapping table is usually added at the device RAM which is more costly in both monetary and energy. For example for one terabyte SSD you need sixteen GB just to start a mapping table if you map at four KB granularity. So instead most modern SSDs use a hybrid mapping technique, they map a lot, a small area of page mapped area, and the large area of block mapped area. Now you have one point eight GB for one terabyte disk. However there’s a performance cost because of garbage collection for hybrid FTL. This is the main reason for poor random write performance in SSDs. So now imagine if you have a file system on top of SSD. You first map from file offset to logical address and then from logical address to physical address. So and there’s an excess virtualization here. What we do is we store our physical addresses directly in file system. Then remove the mapping in the FTL, and doing so we reduce both the memory space cost and performance overhead. Also the one major thing that we want to make sure is that device still have its critical control of this hardware, like garbage collection and wear leveling. So now I will go to the new interface, nameless writes work. I’ll first go through the simulation results and the major design. So this is a major architecture of nameless writes. So you, first you need to port the file system to nameless writes and then have a nameless writing device. The interface between them is nameless, a set of nameless write interfaces. So basically what it does is it returns the physical address to a file system and then files system stores it for future reads. So now I’ll talk more details about the actual interface called Nameless Writes. I’ll show you with an example of how it works. So in the example the orange block you can think of it as a data block, and the blue block you can think of it as an inode. So the inode points to the data block and the file system wants to write the orange data block. So it sends only the data to the SSD and no address. Then the SSD writes it at physical address P and then returns physical address P to the file system. The file system then stores it in its inode. So for nameless writes we also need a set of other interfaces to work together. For example for reads now we need to use the physical address P to read the data block and we call it physical read. Also because nameless write is an allocation process we also need the allocation. So we use free or trim command for the allocation. So the interface I just spoke of is quite simple and naïve but it has a set of problems with the simple interface. The first, if we use nameless writes as the only write interface then there will be a performance cost. Also for devices like flash-based SSDs they need to move physical blocks and we need a way to handle it. Finally there’s a way of how we can find metadata structures efficiently. Yep? >>: You said that flash has to move the blocks. Is that, I thought that the point of this interface is to [indiscernible]. I thought the moving was to deal with garbage collection and so on, and wear leveling but if you’re exposing allocation of the layer why did you still need to move the blocks? >> Yiying Zhang: So one thing I, like I said we still want the device to control its wear leveling and garbage collection. So the device internally does garbage collection and wear leveling, but whenever it does it changes physical address. So it needs a file system two node. The reason why we want the device to maintain its control is that, so one thing is that it can be dangerous if your software controls the hardware directly. The other thing is we don’t think vendors will ship things that they cannot guarantee for the wears. >>: Can you explain the source of the overhead you’re talking about in the first [indiscernible] point. You said that if you use nameless writes then everything… >> Yiying Zhang: I will go through each of these… >>: [inaudible] >> Yiying Zhang: Points in turn. So to explain the first problem of the overhead of nameless writes I will use a simple example. So imagine with a small file in the root directory in the data block is pointed to IS inode which is pointed to a bio-root directory and the file system wants to override orange data block. So what it does it sends only the data to an SSD, SSD allocates physical address P zero and returns P zero to the file system. The file system then stores it in the inode and since the inode has changed it also needs to write to the SSD. So again the SSD allocates P one under tens P one to the file system. Then the file system stores both P one and the offset of the inode within that block P one together to point to the inode. Since the root directory block is changed it also needs a [indiscernible] and the SSD writes it at physical address P two. So you can see in this process there are several problems. So the first problem is that we have overhead of recursive updates. So originally we only go into overwrite the orange block. Now we have three writes and also we need to enforce ordering in this chain of recursive updates. Imagine if we have a long directory chain. So this can go up to n writes, another problem that this, with only nameless writes the file system will become more complex. So originally the inode can be pointed to by the inode number and now we need the physical block address after inode block and offset of the inode within that block together to point to the inode. Also we need a different technique to locate the [indiscernible] to the block. So to solve this, yep? >>: Can I ask another question about this and you [indiscernible]. Does this mean the SSD has to maintain a [indiscernible]? In other words you think you delete a file and if you’re in the file system you just delete the metadata… >> Yiying Zhang: Right. >>: And later you can rewrite the same blocks. But now it seems like the SSD would have to know which blocks that you’ve overwritten… >> Yiying Zhang: That’s why we have to have that free or trim command. >>: Okay. >> Yiying Zhang: And not all file systems supports that trim. So we added that trim in the file system. So to solve the first problem of recursive updates we proposed a technique of segmented address space. For the first address space is a physical address space where we use nameless writes and physical reads. The second address space is a virtual address space where we use traditional reads and writes. Also we need to keep an indirection table for the virtual address space. We map all the data blocks to the physical address space and all the metadata blocks to virtual address space. Note here that metadata is usual small around one percent in the file system layout. So this mapping table for the virtual address space can also be small. Now let me explain how the segmented address space works. So we map all data blocks which is the majority of file system to the physical address space, and all the metadata to the virtual address space. So for the physical address space we send only data to a device. A device allocates physical addresses for them and then writes them to their physical locations. After that they return the physical addresses through the file system and the file system keeps them. Then for the virtual address space we send both the data and the logical addresses together to the SSD. Then the SSD allocates physical address and keeps a mapping for them and then writes them to the physical locations. So now let’s revisit the same problem, same example. So now we want to overwrite the orange block, and the file system sends the data to device, device allocates at P zero under tens P zero to the file system. But the file system updates the inode and now since the inode is in a virtual address space we write both the inode block and logical block address of it arrow in together to SSD. Then SSD will allocate the physical address P one and then it will add a mapping from arrow one to P one. But now we don’t need to change root directory block because it always used an inode number to point to the inode. So we can see that now we only have one level of update propagation. We go from N writes to two writes. With this design it’s also simple to implement and debug. The second, yes. >>: So and I should compare that two writes not to the single write of the file system talking to the top of the device virtual interface but to the two writes of the file system talking. Like these two writes are two physical writes and I should contrast them with when you had the stack virtual address spaces there, the thing that looked like one write the file system became more than one write down in the FTL. Is that why two writes is not bad compared… >> Yiying Zhang: So this is all writes of blocks… >>: Yes. >> Yiying Zhang: Below the file system or, what we want to do is an overwrite of the orange block. >>: Yeah. >> Yiying Zhang: So, and inode points to orange block and there’s a field like change times in the inode. So whenever you modify a file the inode with… >>: Oh… >> Yiying Zhang: Traditional file system any ways it needs to write inode. >>: Okay, so two writes already the [indiscernible]. >> Yiying Zhang: Yes, yes, right, right. So… >>: I didn’t… >> Yiying Zhang: It goes back to two writes. >>: But won’t the inodes total just the block number and if you virtually hung the block number itself wouldn’t you have [indiscernible] just one write in the previous cases with it? >> Yiying Zhang: So the inode usually it also has a field called change times so you’d need anyways update that file modify time. >>: What about the… >> Yiying Zhang: But… >>: Indirect box? Even for a large file the inode… >> Yiying Zhang: Yeah, for indirect blocks yes. So if it’s not pointed directly by the inode, so if it has indirect blocks inside it then with nameless writes you need to update the direct up level of indirect blocks. But with traditional you don’t need to update that inode. Sorry, indirect blocks. But with Ext3 we write all these metadata with a journal so it can gather more data blocks, more metadata blocks together. But, yes there’s a performance cost. So the second problem comes from the need for flash base SSDs to migrate blocks, physical blocks for tasks like wear leveling. Now I’ll show you why this can be a problem for nameless writes. So now imagine the SSD wants to do a wear leveling and moves the physical blocks from P one to P two. After it has been moved its old address can be erased and then could be written with new data. Yes? >>: [inaudible] doing this move because you want to do compaction of all of… >> Yiying Zhang: The SSD needs to do wear leveling. So basically you have different write pattern to different erase blocks and you have different wears. So the SSD actually… >>: You could do better literally without actually moving these blocks [indiscernible]. Just copy the data over, it is… >> Yiying Zhang: That’s what it’s doing. >>: But don’t just write into the same one location. [indiscernible] when you create you could potentially create an [indiscernible] block, write the data to the same one location and you’d… >> Yiying Zhang: But the… >>: And you create a lot of [indiscernible] where other blocks can be stored. >> Yiying Zhang: The same location would have the same red pattern. So if… >>: [inaudible] >> Yiying Zhang: That block has… >>: Yeah so keep the physical address of the live blocks of [indiscernible] and whatever garbage blocks are there use those ones for filling in [indiscernible]. >> Yiying Zhang: So there are two things. One is garbage collection, the other is wear leveling. So for wear leveling, so imagine if you have a data block that has been, is written very frequently so it has hard data. If you move it to get, still it has the hard data and it will, has more and more erases. >>: So this [indiscernible]… >>: I’m confused about this because I thought the interface nameless writes is that you can’t make a hot block because you can’t say what, the file system can’t say I want a write there. It says I would like to write… >> Yiying Zhang: Right. >>: And then the SSD says well you can write over here. So I don’t understand why the SSD can’t do wear leveling at the… >>: Because it’s just [indiscernible]. >>: Because the read has a physical address so it needs, you need to update the file system. >>: The solution to the previous problem Tom was to reintroduce a logical translation line. >>: I see… >>: For solid blocks. >>: And that reintroduces heat. >>: [inaudible] >>: So this is just for metadata then? Is this just for the virtual set of, that you said you got two predictions here right, the physical one and the virtual one? >> Yiying Zhang: Right. >>: This is just for virtual one, is that right? >> Yiying Zhang: It’s whatever the device wants to do with wear leveling. So it doesn’t know what is better than what is data. So whatever it does is whenever it sees that block its written very frequently, then it will consider it as a hard block and that’s why bit with some code block. That’s the basic idea. >>: No but… >>: [inaudible] question which is if a hot block is in the data patrician not the negative patrician then the solution is just to stop writing there. Then when another write comes down it picks another block, right? >>: Right. >> Yiying Zhang: Whenever a write comes down the SSD will always pick another location. But that’s, we hope the SSD does but internally we don’t like make assumptions of whatever the SSD is doing for the wear leveling. But the main point is whenever they do, and they need to do for, at least for the metadata part. >>: Is the, this is the [indiscernible] happening in the background so its synchronously you have data moving around as part of the wear leveling and that’s why it has to notify asynchronously the file system. Because the file system has any knowledge of where the data is that has to get asynchronously updated. Is that the idea? >> Yiying Zhang: Yeah, so basically after it moves, I will go to this now. So after it moves the old address will be written with new data. But the first one will still think this block is at P one. So it will read P one and then wrong data will be returned. >>: Can you give us an example where the device would do this. Like so you have changing interface to do it but you’re not changing the SSDs. >> Yiying Zhang: I’m also changing the FTL but, so the FTL still needs to do the wear leveling and uses, we don’t change the technique of doing wear leveling from traditional SSDs. So what it does it just, when it identifies a hard block it does this wear leveling and so that code data can be written to this block. So it needs to move data around. >>: It’s a background maintenance process. It’s happening all the time. >> Yiying Zhang: Yeah it’s a background… >>: [inaudible] here that you’re sort of inventing new SSD that has new interface or that you’re trying to use a new sort of… >> Yiying Zhang: So it’s… >>: [indiscernible] existing SSD? >> Yiying Zhang: The interface needs to be changed and the SSD, FTL also needs to be changed. But we don’t like particularly change; at least we haven’t changed the wear leveling algorithms. So it still use whatever, it can still use the, whatever it did before. >>: So the wear leveling is triggered from within the FTL? >> Yiying Zhang: Yes, yes. >>: And you did modify the FTL in order to modify the interface. But you did not modify that aspect of the FTL. >> Yiying Zhang: Right, yes. >>: So the suggestion Jon had is something you could have done but did not do. >> Yiying Zhang: Right, right, yeah. But, sorry I can do this offline maybe. So whatever, whenever it does so it needs to inform the file system of this physical address change. The solution we proposed for this is to use a new interface called Migration Callbacks. So now let’s revisit the same example. If SSD moves from P one to P two and after it’s been moved to add a temporary remapping entry to map from P one to P two. Now if the file system want to read it, it will read P one and it will be remapped to P two and get the correct data. SSD also sends this information back to the file system and the background processing file system will process these callback entries and then update the metadata to the new physical address. After it’s been updated the file system will send an acknowledgement to SSD. Then the SSD can remove the remapping table. Yes? >>: Does the file system have any [indiscernible] mapping where the physical blocks [indiscernible]? >> Yiying Zhang: No. >>: Then how does it know where physical block one is? >> Yiying Zhang: I’ll go to it. This is the third problem. So the third problem is like you said we need to locate the metadata that needs to be changed. This can be needed for callbacks and recovery process. Naively we can scan through the whole process to find the right metadata but of course it’s too costly. So what we use is a simple solution and what we call Associated Metadata. It operates like a back pointer. So it keeps information like inode number, inode generation number and offset. So with this information you can find the metadata. So what it does is whenever it, file system writes a data it also sends associated metadata to the block, to a device. A device writes it in, sorry adjacent location next to the data page. This is the OOB area and whenever it does callbacks on recovery it uses associated metadata to locate the right metadata block. Did that answer your question? >>: Looks like… >>: So for every block you have some small amount of an extra storage block. >> Yiying Zhang: So the OOB area is already there so it’s used to store things like valid page, it’s within a flash memory. So it has usually a one twenty-eight byte area next to each four KB flash page. >>: And is this something that was previously completely private to the SSD and you’ve now exposed the file system? >> Yiying Zhang: Yes and I’ll show you why, how that can be a problem. But the SSD will write the OOB but the file system needs to assume that such OOB can be returned. Also for reliability reason which I’ll go over this now, so we’ll make some further assumption for direct reliability reason. So somewhere a few may have already noticed that there can be reliability issues within this process and the nameless write process. What it, it can happen is if a crash happens during the migration callback before the metadata block is updated you have inconsistent metadata. There’s also other reliability issue that I will go into in details. Yes? >>: Well it sounds like you just declared that every block that was previously four K is now four K plus a few bytes. You’ve got this all bit of extra data on the side with every block… >> Yiying Zhang: Yes. That’s… >>: That’s funny interface to happen to a file system. >> Yiying Zhang: But that’s the solution we have and we’ll see that OOB area is already there and it’s usually one twenty-eight bytes. I think we use around thirty-two bytes something like that. >>: And, and the flash patterns data the block and it’s not already using it… >> Yiying Zhang: For flash page. >>: Purpose of the time. >> Yiying Zhang: It’s using it but we are hoping that they are not, fill it. But that’s a problem and that’s a big assumption that we make. >>: So, ha, ha, if I’m a flash finger it sounds like you’ve suddenly magiced up an extra few percent of storage in your flash drive… >>: Ha, ha. >>: That we could use for our own purposes… >>: Well… >>: I’m sure they’re putting, they’re putting that storage there they’re probably using it for something. >>: ECCs. >>: Yeah. >> Yiying Zhang: It’s using for ECC and valid bit. But it has a larger space. >>: Okay, is the [indiscernible] argument is that this is the, if you’re willing to change this interface then maybe you’re willing to change the shape of a rectangle ever so slightly, remove a couple rows and add a couple columns to make it adequate… >>: Well, so, so you hypothetically couldn’t the file system just, instead of using the flashes private [indiscernible] processing just take the device, drop off some extra space that it’s not going to use to store file blocks and use that to store this kind of data? >>: Oh, I thought that part of the goal was to be able to do it in continuous write. >> Yiying Zhang: So basically when you read for like callbacks and other things it can be, if you store it adjacent to the flash page you know it would be out of band area then you could read them together. You only pay one read otherwise you need to read two flash pages. But that’s a big assumption that we are making here. Yes and that’s a cost your device will pay. Yes? >>: Does that mean that the OOB currently used in service to the FTL? Which then you wouldn’t need because, which you wouldn’t need then you could repurpose [inaudible]. >> Yiying Zhang: It’s using it but at least as far as I know they’re then used two-four one thirty-eight bytes… >>: But I’m just saying… >> Yiying Zhang: And FTL. >>: Because you don’t, this part of the address space as I understand that you’re saying is not part of, is not being remapped so maybe you could reuse that, those bytes that were being used for logical to physical mapping for purpose of storage, is that an option? >> Yiying Zhang: Yes, but again you are saying that you can, we can start back pointers with metadata, the virtual address space within that space. The OOB area is that what you’re suggesting? >>: I think I’m not being clear. I’ll talk to you about it offline. >> Yiying Zhang: Okay. So but for these reliability issues our basic technique is again to rely on the associated metadata or the back pointers and we can use these back pointers to reconstruct metadata during a recovery. Also a side thing is that we can also, we also keep a timestamp which also increases the OOB requirement to find the latest version. So, and we have two requirements here. So a major requirement is that we need to atomically, first there needs to be OOB space for us to write, and now we need to atomically write OOB area and the flash page for reliability issues. Also this is something I won’t go into detail but if you’re interested we have a different garbage collection method which is better for reduced migration callbacks. But we need to use NVRAM during this garbage collection cost and I can talk more for, with, one on one. Our reliability results show that we can recover all system crash cases. The recovery process is zero point four to six seconds for four GB SSD. I implemented Ext3 with nameless writes and we used journal, ordered journal mode because it’s a widely used journal mode. Also it fits nameless write well because it always writes metadata after data. Also we segmented; we support the segmented address space different interfaces and migrating callbacks. The total lines of code is around forty-three hundred and they, existing Ext3, JBD, and Generic I/Os and you can see that they’re more invasive than at least I originally thought. >>: So did… >> Yiying Zhang: Yes. >>: Have a, what was on the other side? Do you, did you modify SSD or do you have simulator that… >> Yiying Zhang: I’ll go, I’ll go into that. >>: Okay. >> Yiying Zhang: In a couple slides I think. So to put a nameless write, device to nameless writes it first needs to support all the interfaces. Also we allowed the device to choose any kinds of allocation. A specific device that I built uses log structured like allocations because it gives you more sequential like write performance. Also it needs to maintain a smaller mapping table for the virtual address space and the remapping table. Also one thing we still let the SSD control is its garbage collection and wear leveling technique. I have some optimization for these things and I can talk more offline. So what I built first is a pseudo block device drive Linux to emulate the nameless write device. It sits directly below the file system and talks to the file system with a nameless write interfaces. It writes all the metadata and data in RAM so it can emulate a performance. I built three types of FTLs. The first FTL is page level FTL. It’s a log structured allocation. So to give you ideal performance but unrealistic is mapping table space because it maps with each four KB page. The second FTL I had is hybrid FTL. Which I tried to model real SSDs and finally I have nameless writing, write SSD. So this is the mapping table space result that we have. So we used impressions to generate typical file system images. Here I show one hundred GB and one terabyte of file system images and the mapping table sizing MB. So we can see that the page level FTL uses a lot of mapping table space and hybrid has moderate mapping table size, and nameless has very trivial mapping table size. In fact we find that nameless writes uses fourteen to fifty-four times smaller mapping table than the hybrid mapping which is, models are real hot, real SSDs, and one twenty to four sixty-five times smaller than page FTL which is a performance upper bound. >>: So the way I should think about that savings is the terabyte of flash would have incurred the cost of buying another two gigs of RAM… >> Yiying Zhang: Right. >>: And this makes it almost free. So what portion, how do those costs relate? Do we just save ten percent or five percent of the cost of the device? >> Yiying Zhang: So the fixed cost is one thing and the other cost is energy cost which is more of a concern for… >>: Do you have any idea what fraction of the energy device goes into the RAM? >> Yiying Zhang: I don’t have an exact number but I think a lot of it goes to the RAM. It’s especially true for a cell phone, mobile devices. They consume more bod energy. Yes? >>: So have you [indiscernible] model with the mapping table’s redundancy [indiscernible]… >> Yiying Zhang: Mapping table themselves? >>: Mapping tables themselves… >> Yiying Zhang: Oh, yeah, yeah. >>: In the sense that… >> Yiying Zhang: Right, right, I know, yeah. >>: You wouldn’t probably need the entire mapping table… >> Yiying Zhang: Yeah, yeah but then we need to assume there’s a small size of our working set size. >>: Right. >> Yiying Zhang: And also you still need storage for the whole mapping table. But that’s maybe a lower cost. So now I will show you the performance results and our first [indiscernible] sequential and random write performance in super DKLs. Here we can see that for sequential writes all the three FTLs perform similarly. For random writes hybrid is perform much worse than page and nameless writes. >>: Right this is spreading against to your simulated block types in all three [inaudible]? >> Yiying Zhang: Yeah, yeah, all the, I build all the three FTLs, so. And… >>: Is the, in the simulated, is the simulated device slowing down RAM access to mimic? >> Yiying Zhang: Yeah that’s a [indiscernible], so. >>: Just making sure that it’s [inaudible]. >>: Sorry, I’m too slow. I have a question on the previous set of results. Is, for the hybrid scheme where you’re splitting into the data region and the metadata region isn’t the savings dependent upon your work load? >> Yiying Zhang: This hybrid is not a… >>: Oh. >> Yiying Zhang: Nameless write hybrid. >>: Sorry. >> Yiying Zhang: A nameless write… >>: Sorry… >> Yiying Zhang: Is what I proposed. >>: For the nameless write. Isn’t it dependent on the work load? >> Yiying Zhang: Yes. >>: If I have a lot of files… >> Yiying Zhang: Right, right. >>: And a lot of metadata… >> Yiying Zhang: That’s why we use this tool called impressions. It generates typical file system images. >>: So, I guess what I’m wondering if I was going to build an SSD… >> Yiying Zhang: Yes. >>: How big should I make the mapping table? [indiscernible]… >> Yiying Zhang: You can have a fixed… >>: The worst possible, worst case… >> Yiying Zhang: You can have a fixed mapping RAM space… >>: Yeah. >> Yiying Zhang: And then whenever it’s going beyond that it would just go to the normal flash page, page out to the normal flash page. >>: [inaudible] >> Yiying Zhang: But you can typically it’s around one percent metadata size. >>: And SSDs already have that property today’s when they build. >> Yiying Zhang: What property? Like they have a RAM and now they have a fixed size, they can calculate for both page and hybrid, it’s a fixed sized mapping table space. >>: Yeah… >> Yiying Zhang: And for a nameless write… >>: For an SSD you made the performance properties of the device like by, way more complicated because today if I buy a device, you know it has a certain random access time, certain read performance, write performance. But in the future say if we adopt this the performance of my device depends upon [inaudible] where I’m storing it. If I’m storing lots of big files or lots of lots of other files that I’m using the mapping table… >> Yiying Zhang: Right. >>: Paging in and out. Does that make sense? >>: I mean to put… >> Yiying Zhang: That’s… >>: A sharp concern, point out here you’re talking about a cliff right when suddenly my flash device starts swapping. >>: My flash device starts swapping because I put too many little files and enough big files on my file system. >> Yiying Zhang: But maybe another way to answer it also hints at fashion designers to have more smarter technique to reduce their metadata space. >>: It’s not a bad thing necessarily but it’s another complexity in performance analysis. >> Yiying Zhang: Right. So here although we found that nameless writes has twenty times random writes throughput than hybrid FTL. It’s close to the page FTL which is the performance upper bound. Now I’ll also you some macro-benchmark results. This is Varmail, FileServer, and WebServer from the Filebench Suite. Here I show through putting MB per second. You can see that for Varmail and FileServer hybrid is doing much worse than the other two FTLs because Varmail and FileServer has more, have more random writes. For WebServer it only has less than four percent random writes so everything performs similarly. So to summarize the nameless writes, at least the emulation part, we find that we use a new interface for de-virtualization and it can reduce both mapping table space costs and performance overhead. Now I’ll go to the next piece of the hardware prototype. So as a researcher we often evaluate our ideas with emulation or simulation at a more conceptual level. This is fine because it’s a more conceptual level and actually it’s, this is a most common case for flash related research. But if you imagine if you or someone wants to actually build a thing it can be more complicated. So to transform ideas to reality we can see there are many differences. So first real hardware is more complex and then the real interface can be different. Even the software stack to the hardware can be different. Actually nameless writes change all these three parts. So it will be good to see if the ideas still hold with real hardware. As a person myself I was also curious and excited to build a real hardware prototype because I haven’t done this before. So I had this testing board, SSD testing board. What it has it has an Indilinx Barefoot ARM SSD controller and an internal sixty-four MB DRAM, and a set of flash chips. The interface two hosts is set out to bind all interface and it also has a couple of serial port for debugging. So now what we need to do is to transform emulation to reality. So we can see that reality is more complex than emulation. So below file system there’s block layer SCSI, ATA, and AHCI driver. Then finally there’s a SATA interface to the SATA SSD device. Also in different layers there can by command queues and I/O schedulers. So now what we need to do is to map nameless write interface to the whole OS stack. Also we need to build the SSD emulator to a real device. So my first design is to build a nameless write FTL at, on hardware device and work everything else towards it. Then I found a set of problems with this design and moved to the second design which placed most of the nameless write functionality in OS. Then have a raw firmware with the device. So I’ll go through the problems that I had with first design and then come back to the second design to show you why it works well. But before going further, I’m sorry that again giving you some background on SATA. So basically what SATA has it has a stack of layers and the top layer is a software layer which can be controlled by host and device. Lower layers are more rated to hardware and use, and do the actual physical communication, but all we need to know is that the lower layers are in host and device ports and cannot be changed. The top layer talks with ATA’s command standards and can be changed. The standard cannot be changed but the top layer can be changed. So I’ll talk more about the ATA command standard. So for ATA command there’s a first set of non-data commands. So it sends from host to device the command and then optionally LBA, size, and features. Then from device to host the status and then optionally LBA, size, and if there’s an error the error bit. For the actual error it’s a [indiscernible] PIO and DMA and from host to device it sends a command LBA, size, and data to be written. From device to host it has status, data to be read, and if then an error the error bit, and the first logical block address that has an error. So now I’ll go to the problems that we need to solve to build a nameless write FTL with the hardware. So the first problem we have is we need to add new command types. Then another problem here that we need to return physical addresses back to the file system, and also we need to integrate the migration callbacks into the interface, and the finally there’s some consistency problems. So I’ll go over each of these in turn but maybe quickly for the easier ones. So this one is much easier, so what we need to do with a new hardware interface is we just add a flag in the OS. For device we use the highest two bits in the size bit, size field to represent a new command type. The second problem here for nameless writes we need to return the physical addresses and this is a more severe problem, because there’s no ATA. Yes? >>: So [indiscernible] nameless writes is not something that is being specified in standardized [indiscernible] hardware before. >> Yiying Zhang: So basically the ATA’s command standard it doesn’t support any address return back in the return path. Our first attempt is to re-purpose the error attempt. So if you remember the error it will send the first logical block address that has an error. So instead we can just send a physical address. So this seems viable but whenever the device sets an error bit the device will freeze. Our second attempt is to re-purpose a special [indiscernible] ATA command. This is called the read native max address which is the only command that’s expecting address. So it’s expecting a maximum address of the device. But instead we can just send a physical address back and then with this physical address send a nameless write. There’s also problem with the second attempt. So first thing is we now use two commands for each nameless write command, nameless write function. Also there’s a danger of silent data corruption. Yes? >>: Why does the device freeze just because it returns an error? >> Yiying Zhang: I’m not sure but it, every time whenever it sends an error interrupt the device just freezes. At least with that board maybe it’s not true for all the board. >>: Okay, so basically the first one we did of [indiscernible] It’s effectively like allocating the region and use the physical address and the next one sends to the write [indiscernible]. >> Yiying Zhang: Right. >>: So if I have lots that’s happening at the same time how do I associate… >> Yiying Zhang: That’s why there’s a… >>: You’re going to get to that, okay. >> Yiying Zhang: That’s why there’s a silent data between… >>: That’s okay. >> Yiying Zhang: Problem… >>: That’s only employing one write at a time. >> Yiying Zhang: So imagine if you have like you allocate two sectors in the same flash page but then you have different processes and you have different scheduler at different places. So finally when the nameless write happens you can write a same page twice. Again that’s why you can have silent data approaching. The next problem is even more difficult that we need to send an address back from the device to the host with migration callback. One possible solution is to piggyback with other return but then again it has the same problem as the second, the last problem. Another option is an OS can poll for this migration callbacks but it’s more costly performance. Finally there’s a device related assumptions that then hold for like the OOB area that we required the actual hardware, actually the FTL doesn’t have the access to the OOB area, at least for that bod. So it has an OOB area but it’s automatically with the hardware doing the EEC and other things. So we don’t have a [indiscernible] to that and there’s no atomic writes. Yes? >>: [inaudible] >> Yiying Zhang: So this is the firmware that we are changing. But there’s also like a more… >>: [inaudible] why didn’t you stick with the standard [indiscernible]? >> Yiying Zhang: That’s my next design. >>: [inaudible] [laughter] >>: USB or something might cause you less headache. >>: Yeah PCI… >>: [indiscernible] >>: Protocol, yeah. >>: Yeah. >>: [indiscernible] thirty-two if we’re going to… [laughter] >>: Do it at all. >> Yiying Zhang: So one thing that I won’t cover in the slides is that you can probably use PCI and more [indiscernible] PC like costs. But I have another design with, stuck with the ATA interface. But this is more we suspect that these are more specific to the, this hardware bod and may not be [indiscernible] with your hardware. So we can see the first design didn’t work and that’s why I moved to the second design. The second design places most of the, I’ll talk about this design. But it places most of the nameless write functionality in the OS. Then the old interface between the file system and this block layer nameless write FTL. So everything can be done in software. It sends only raw commands to device. Raw commands include flash page reads and writes, and flash block erases. These raw commands can be easily built with ATA and it’s also easier to implement and debug. But there can also be danger in this design if we are not careful. So at the first, at first what I did is I placed a nameless write FTL on top of the I/O scheduler. So what it can happen is that it can again write the same page twice without erasing it because the I/O scheduler is below the FTL. Then there will be a silent data corruption. So next what I did is to move the I/O scheduler above the FTL and then I also make sure that there’s only, because FTL does allocation after the I/O scheduler so I can make sure that there’s at most one write per page. Also I have a special I/O scheduler that designed for nameless write. Basically it can merge any kinds of writes because the nameless write FTL will anyway allocate the physical addresses. This is a performance result. We show sequential and random write and sequential and random reads. This is throughput per KIOPS. You can see that in some cases nameless write is outperforming the page level FTL. This is because of the special I/O scheduler that I had. But overall we found that nameless write performed similarly as the page level FTL. For the mapping table size we have the similar conclusion that it’s much smaller than page table size, page FTL. So with this hardware design I learned several lessons. So first adding a new command is quite easy but adding the information in device through a [indiscernible] path. Adding any information, initial data from device is much difficult and also there’s certain assumption that may not hold with hardware. Most of these are because fitting nameless write interfaces into ATA protocol is fundamentally hard. That’s why I moved to the second design of the Split FTL. This way with moving FTL functionality mostly to OS stack we can make things much easier. But there’s a dangerous when the device loses its control and let the software control its internal hardware like I/O scheduler example. Also because we are placing the hard, the FTL in OS it will consume more host CPU. So to conclude nameless write we, with nameless write we can see that a large part of the, of excess virtualization can be removed. It improves both mapping table space, energy cost, and performance overhead. But it’s hard to integrate nameless write into the ATA interfaces. It also requires fundamental changes to the firmware and file system, and the OS stack. So now I’ll talk about the last piece that I’m working on now. So the first motivation why we need yet a different technique to perform de-virtualization is to reduce complexity as what the nameless write has. So ideally we want to have no file system changes, no device changes and it can still work with the current I/O interface. The second motivation is slightly different use case. So originally when we designed nameless write in mind we had this hardware in mind that the hardware maintains the mapping table in directionally here. But now what it, many system have is it has a software layer to maintain the indirection tables. Whenever the storage device needs more mapping table space it will contact the OS or the hyper-visor to allocate more memory for it. So in this situation we want to have a more dynamic way to reduce the de-virtualization and the mapping table size. Also we want to do it whenever needed without the complexity of nameless writes. So that’s why I built this tool called File System De-virtualizer or FSDV. So what it does is it, normally everything works as traditionally so there’s a device virtualization layer. It can be in software or it can be in the storage device and then the person talks with it in, with normal block interface. Also it keeps a mapping table around. Periodically or whenever needed FSDV is involved. It talks with storage device with normal I/O and simple command to the virtualization layer. After FSDV is involved the mapping table can be removed. So what FSDV does is that it changes file system pointers from logical pointers to point to physical addresses. Doing so will require small function changes or device changes and know it can still use normal I/O interface. So I’ll show you how it works with this example. So first imagine offline process that it’s the best FSDV first unmount the file system and then process individual files and then finally mount the file system. So in this example there’s an inode and then some indirect blocks, and then finally data blocks. What it does is it go; it looks at the pointers and changes them from logical to physical from bottom up. So it will first look at the bottom layer and then it will ask the virtualizer layer what’s the physical address for L one. So the virtualization layer returns P one to the file system and then file system, to FSDV, and then FSDV changes pointer to P one, and similarly from L two to P two. Then the indirect block is changed to point to P one and P two. After its been changed the FSDV will write this block to the device. Then the virtualization layer can remove the mappings and similarly for P three and also for the other trees, part of the trees. Finally the top level inode and then it can remove the final mapping. We also keep a log for consistency. This process…yes? >>: So when the [indiscernible] kind of data structure that you use for the [indiscernible] mapping so that if you do something like page data then, I mean removing only a few of those will actually leave a lot of savings because you still need to store the entire page containing the map. What is the current [indiscernible]… >> Yiying Zhang: This, this mapping table size? >>: [indiscernible] >> Yiying Zhang: It’s just a mapping table for the device virtualization. It’s a virtualized device so it will keeps its internal, either the device or some virtualization layer for the device will keep a mapping from logical address to physical address. >>: So [inaudible]. >> Yiying Zhang: So the process I just outlined is an offline process. But it has more performance overhead. So what I did is to make it perform online and that’s the part that I’m working on now. So basically the first optimization that I had is it will only process the changed file from the last run. So from the last run of FSDV it will keep a record of what files have been changed and only process those files in next run. Another online process that we have is to allow foreground I/Os perform while FSDV is running. So what FSDV does is it only process files that are closed and not in page cache. Also whenever it’s processing a file it will block all the I/Os to that file. So to summarize the FSDV work we can, we proposed a simple tool for de-virtualization by changing file system pointers to point to physical addresses. It requires more changes to file system, small changes to device and it can still use normal I/O interfaces. Now I’ll present some of my other work and future work, and finally conclude. So the part that I talked about just now is about de-virtualization a single flash based SSDs. So what can happen is that we can have flash based erase. One problem that I looked at is correlated failure. So imagine if we have mirrored SSDs and the mirroring pair will receive the same write patterns, so they will die at the same time. So what I did here is just to add some dummy writes, what we call dummy writes so that one device can fail sooner than the other one. Another work that I did when I was interning with Microsoft Research is duplicate-aware Arrays which uses inherent duplications for availability and durability issues. Also flash can be used as a cache. So I helped with project in, of Solid-State Cache and also there’s a problem of storage-level cache warm-up. So basically when you have a large SSD cache which can be hundreds of GBs and you need to warm that up, so on demand warm-up doesn’t work anymore. So I had some analysis of data and some tools to make cache warm-up much faster. For future work I’ve been thinking about mainly two directions. So the first is the continuing of this identification and removal of excess redundancy. So what nameless write and FSDV does is to remove the redundancy in pointers or mappings. There can be also other kinds of pointers like ND duplication or other kinds of devices that has a virtualized indirection layer like Shingle Disk, I don’t know if you’re familiar with that. Also there’s a content redundancy which has been allowed before. But one new thing, at least I haven’t, I think it’s kind of interesting is algorithmic redundancy. So think if you have a lot of layers and if you have say I/O schedulers at different layers they can be redundant or even worse contradictory. So, all these are because we have more and more complex software and software stack and hardware; so there will always be some redundancy there. Another direction that I’ve been thinking about is new abstraction and the interface, for things like storage class memory, software defined storage. What I think is the interface should be more customizable and also by exposing the right amount of information from the lower layer can be beneficial. I can talk more about these things offline. Finally to conclude let me use a famous quote that, “All problems in computer science can be solved by adding another level of indirection”. Which is usually attributed to Butler Lampson who actually attributes it to David Wheeler, and David Wheeler usually adds, “but that usually will create another problem”. That problem is excess virtualization or indirection. So to solve this problem my first proposed to the interface called nameless writes and build emulation and the hardware prototype for nameless writes. Now I’m working on this file system de-virtualization tool which is a write word tool which then have the complexity and fix more than it makes situations. It improves both space and performance. To finally conclude the first thing I want to say is in my PhD experience I found that reality is important. So it can be, ha, ha, looks like everybody knows, ha, ha. It can be more different from your research and even if you don’t want to actually build a, the real thing it’s always good to know the reality so that you can maybe say to others how you may want to build this. Another aspect is theory which is more recently what I’ve been thinking. Is there any information theory behind this redundancy of X, X can be anything? I’m welcome to more discussion along this line. Thank you, questions? [applause] >>: I have a question about the online FSDVs. Can you go back to that slide, the one after the offline one the one where it actually gives it a running system with files are changing? While you’re getting there, so is there, while it’s running the file systems has idea where some files are and what files are and is there a process where some files is mapping pointer to point to physical addresses are the ones that are still [indiscernible] but ones that are actively… >> Yiying Zhang: Yeah. >>: And how can you do this without changing the file system? >> Yiying Zhang: So there’s one, only one small change that we need. So you point out that a block address can both be logical and physical if you’re not careful. So basically we offset the whole physical address to the next offset. So basically if you have a one hundred GB device, so zero to one hundred GB is logical, represents logical address space and one hundred to two hundred GB is for physical address space. >>: Okay so they both might map to the same actual cell on the… >> Yiying Zhang: Yes, yes but device whenever it sees it’s beyond one bit node that it’s [indiscernible]. >>: But you’ve got to go and fiddle with the memory, the mapping information of the file system while mounted. >> Yiying Zhang: Yes. >>: Process systems tend to take kindly to that… [laughter] >>: But that’s, it’s, stuff is really fascinating. >>: And you assume that you see in a page cache all instances of blocks that might be cached in anyway by the file system? So if the file system [indiscernible] its own cache or the file system kept and pointed to a logical block hanging around somewhere in a variable but wasn’t in page cache then all bets are off. >> Yiying Zhang: Yeah, first this part is I’m still working on but at least now I’m only at page cache and also files that are closed, at least the function path that I’ve seen they all [indiscernible] page cache, uhuh. >>: So I just want to go back to deleting files quickly. It still seems like delete a file that file system would have to go back and mainly, or explicitly mark every block in the file as free. >> Yiying Zhang: You are talking about bit back? >>: About? >> Yiying Zhang: Is it the bit back that file system is using? >>: Well, the, in a current file system if you have a big file you delete it and the file system just does some edit updates, maybe it shreds a new inode… >> Yiying Zhang: That’s a bit back. >>: But then when you try to run a new file on top of the old space you can just issue writes to those old blocks, the old blocks get overwritten, it’s fine because that’s how it’s been deleted. >> Yiying Zhang: Right. >>: But with the SS, with the nameless writes where the SSD is picking where to write it has to know that the block is allowed to be overwritten. So it seems like when you delete a file you have to go through and, the interface would have to go, would have to change and then when a big file is deleted millions of block deletion requests go down to the SSD to mark all those blocks as free. >> Yiying Zhang: So what, at least the way I implemented this it collects a set of blocks that has been deleted and then sends it to the device. The device will only invalidate when the actual physical address, whenever it receives this, what we call trim or free command. >>: But the trim is not like a trunket, what’s the difference between trim and delete? >>: The file system [indiscernible] trimmed down when… >>: It goes and do, during a check point it goes and marks the, it other bit map. If it was done [indiscernible]… >>: Trim is just a silly name? >>: Yeah. >>: Okay. >>: It’s clearly a device data attribute, so. >>: Okay. >>: Is this [inaudible]? >>: It’s a block level. >>: Yeah. >>: Okay. >> Yiying Zhang: But for the bit maps it doesn’t need to change. Actually we don’t actually use bit maps anymore other than for the virtual address space. So it only needs to know how many blocks it has been allocated so that it doesn’t exceed total amount. So there’s only this trim command that then send the actually data, it’s only telling the device which blocks you can free. >>: Okay, another quick question right here. It seems like you’re saying major benefit is that we’re using less RAM. You showed some RAM comparison numbers. Did that, there were a couple of things that like I wasn’t sure if you were accounting for, one is you’re adding, you’re taking away some data structures but you’re adding some new ones. So did that comparison account for the fact that there was extra metadata being kept around in your scheme and the original scheme didn’t need? >> Yiying Zhang: So there are two places you keep. So one is the internal DRAM which the mapping table has to be set there all the time at least for the hard working set and the other thing is the OOB area which we use for storing associated metadata. It doesn’t need to be accessed all the time. It only needs to be accessed whenever there’s a callback or whenever there’s a recovery. So it doesn’t need to be loaded into the RAM… >>: [inaudible]… >> Yiying Zhang: And it’s cheaper with flash than with… >>: I think… >> Yiying Zhang: The internal DRAM. >>: Also the SSD would need, it can’t get rid of the entire mapping table because you still need the mapping table for the portion that’s used for the inodes and things, so. >> Yiying Zhang: Right. >>: Did that, what ratio of data to metadata was that comparison assuming? >> Yiying Zhang: I, the results I used is this tool impressions. So it has different types of file SIM layout, but usually it’s less than ten percent, so one to ten percent. That depends, that will decide how much space you need for a nameless write for the mapping table size. >>: Okay. >> Yiying Zhang: U-huh. >>: Have you heard about this new type of hard drive which is combination of SSD and [indiscernible], and there are hint command with that hard drive. Have you heard about hint command? >> Yiying Zhang: Hint? >>: Hint is a new command set. >> Yiying Zhang: I have heard of the hybrid but I haven’t heard of hint command. >>: Hint is basically like tree… >> Yiying Zhang: Oh. >>: It’s another hint from the OS to the hard drive saying certain area is hot… >> Yiying Zhang: Okay. >>: Than others so for the trim basically says I’m not going to use this area anymore... >> Yiying Zhang: Yeah. >>: Hint is basically says this areas going to be used more often. >> Yiying Zhang: Okay. >>: For those, basically that kind of hybrid hard drive what do think file system do you think we adapt to use on [inaudible]? >> Yiying Zhang: What do you mean? You are saying file system adopt to… >>: What I’m basically meaning is there anything else file system should do to use those hybrid drive more efficiently? >> Yiying Zhang: So one thing, first a hybrid drive it can probably detect temperature of the data themselves but if you have more file system support and one thing I think can maybe improve the performance is that for the SSD part it has some internal operations like wide opening garbage collection. It can tell the person whenever it’s doing so. So the person can schedule the things differently. Is that, did that answer… >>: [inaudible] >> Yiying Zhang: Okay. >>: Right, I see. >> Yiying Zhang: Thank you. [applause]