>> Ed Nightingale: I think we'll get started. ... my great pleasure to introduce Haryadi Gunawi. He has...

>> Ed Nightingale: I think we'll get started. Jay will probably get started. So it's my great pleasure to introduce Haryadi Gunawi. He has done great work in a wide variety of forums, in SSOP and OSGI and FAST and ISCA, giving us a deep understanding of the way storage systems propagate and handle failures in both hardware and software. So today he's going to be telling us about how we can create more reliable storage systems. So thank you. >> Haryadi Gunawi: All right. Thanks, Ed. And thank you all for coming. So today I'm going to talk about my research contributions towards building more reliable storage systems. So again, just like Ed says, my research is about file systems, which is an important piece of the operating systems. And if you look at years of file system research basically we will find many, you know, innovations and many different aspects. For example, people try to improve performance by improving a location and policies, improving live performance and many other aspects. Functionality has also received a great deal of attention. People try to make the file systems scaleable so that we can store millions of files. People try to incorporate search interface within the file system so that we can easily search our files. Well, our Internet reliability has also been a focus simply because the hardware fails and no one wants to lose their data. So what I want to focus in this talk is to focus on reliability aspects of the file systems. Okay? So since my research is about reliability, here's the roadmap of my reliability research. So my research was driven from this unfortunate reality that storage systems still fail today and the failures that arise are quite diverse, as we will see soon. And the file system it's two minutes all this kind of failures. So the first questions that I ask is how do we know that today file systems are doing a good job in dealing with search failures? The first part of my research is to measure the reliability of today's file systems in dealing with storage failures. Turns out what I have found is that in today file systems, the failure management is very complex which leads to many problems that I will show you later. Okay? So given this another unfortunate reality, the second part of my research is to build more reliable file systems that can deal with storage failures in a better way. Okay? And in order to do that, I will adhere to the principle that complexity is the enemy of reliability, so my hope is to bring today file systems and so that we can build tomorrow file systems that are powerful, reliable but with simplicity in its design. Okay? So given such a roadmap, here are my contributions to the more concrete way. So since this is a file system research, there are two important components that I have looked at. The first one is this file system checkers, or in short FSCK; and the second one is the file system itself. So when we talk about FSCK, this is a very important utility that should repair any damaged, you know, file systems and bring your file system to a consistent and useable state. But when I measured the reliability of today's file system checkers, I found they are not really reliable, I found that some repairs are missing and some repairs are actually corrupt so the resulting file systems become unusable because they bring more damage to your file systems. Okay? And I believe the reason why this problem exists is because of the complex implementations of today's checkers. All checkers that I know of are still written in this low-level C code language which is hard to reason about. Okay. So to improve these situations, I will introduce you to the idea of SQCK, which is the robust file system checker that uses a declarative query language such as SQL so that we can write hundreds of check and repairs in a much more clear and compact manner. Now, when we look at the file system itself, again turns out today's file systems are not really reliable because some storage failures are being ignored and the existing policies are inflexible, they are very hard to change. So if you want to add new policies, it's just hard to do in today's file systems. And I've list the problems exist because of the reliability code that deals with storage failure in today's file systems is very scattered and there's no good abstractions in dealing with storage failure. So in order to improve these situations I will introduce you to the idea of I/O shepherding, which is a novelty reliability layer where we can deploy centralized flexible and powerful policies. Okay. And again, in profiling the solutions my principle is about simplicity. All right. So here's the outline of my talk. So in the next section, I'm going to give kind of a little background of my research. Why are we still doing reliability research although file systems have been out there for more than two decades? Okay? And I'll show you how storage subsystems fails today and then I'll move on to my two major contributions which is improving the constructions of file systems checkers with SQCK and simplifying reliability management with I/O shepherding. And for the rest of the time I'm just going to briefly describe my research in other just a preliminary area of software engineering that are basis distributed systems and networking, and then I'll [inaudible] work and conclusion. All right. So why are still we doing kind of reliability research? Because I believe the failure models that there are new failure models that haven't been looked at really, kind of in more details. But if you look at the old fault models, there are two fault models that have been pretty much solved today. The first one is to anticipate whole system crash, and the whole idea people realize that the whole system can crash and that's not good for a file system, because file system performs multiple updates. And if a crash happens in the middle, you can have inconsistency on a disk. Okay? So basically file system researchers were the idea of logging and journalling so that we can perform automatic updates to the disk. The second fault models is to anticipate whole disk failures. Again, people realize that the whole disk can be lost and if that's a case, then our data can be lost permanently. So Patterson, Gibson, et al basically suggested the idea let's just have more disk and if we have more disk we can do a simple mirroring like this or to reduce this pace overhead, what we can do, we can do a parity scheme. But the whole point that if a disk is lost, we can still reconstruct the data. Okay? But we will see that today's modern storage failures are much more diverse beyond whole system crash and whole disk failures; and so the model of failure has changed a little bit, and I will show you why. The reason is because what's underneath the file systems is not just a simple disk, but it's a stack of complex components. We have device driver, we have the controller, we have the firmware, [inaudible] components and media. And the point is that all these components are not always reliable. People have seen how these components can fail in practice. So basically my point is that from the file system perspective there's a broad range of failures that can arise. Not only we have whole disk failures where we -but we also have partial failures where some [inaudible] on the disk become inaccessible. Sometimes we have intermittent faults, sometimes we have permanent faults, and sometimes we have these kind of crazy failure modes where the write's suddenly misdirected, turn, or lost. Okay? So over the time, because all this problem or data can be lost and corrupted. Now, if you wonder how disk failures happen in practice -- yes? >>: I couldn't follow why are these new failures [inaudible] I mean [inaudible]. >> Haryadi Gunawi: Right. This -- right. >>: You are seeing these failures [inaudible] exist before? >> Haryadi Gunawi: Yeah. It exists before, but in the community they consider it anecdotes. It's like they don't believe that disk failure happens unless ->>: Not that the failures are new, it's that the [inaudible]. >> Haryadi Gunawi: Right. Yes, yes, yes. >>: You just said that the failures are new. >> Haryadi Gunawi: Right. I guess new in the sense that we look -- we need to look at this kind of failure much that we haven't looked before. So I guess new in that kind of sense. Right. Okay. >>: [inaudible] fairly low rate failures but as you get bigger and bigger storage systems the chance of their occurring gets larger. >>: So as I understand it, much of what you said applies to disk, but nowadays there are solid state [inaudible] flashes starting to replace disks in many sites, including wireless. Can you say a few words about that? >> Haryadi Gunawi: Very good point. Well, first is that I don't think many people have looked into the reliability aspects of flash devices. That's the first thing. And if you look at again the storage file systems it's that when we look at the flash drive, it's just kind of the lower level component that has been changed. But in the middle you will have kind of this device driver for drives, right? And when kind of disk flash drives are becoming more popular usually the device [inaudible] will be much complex in dealing with any kind of things that are within the device. So when we talk about file systems, it needs to deal with all the things that happening underneath this, not just kind of the media stuff. >>: [inaudible] going to be possibly applicable to flash [inaudible]. >> Haryadi Gunawi: Yes, I believe so. Thank you. All right. So as you can see, these failures happen in practice, and I want to just focus on the last two, which is a large scale study done by my colleague and people from the storage industry in this case network appliance. So they study 1.5 million drives that they kind of sell to their customer and within this population, four percent of the drive, actually a bit lighter in sector errors where some part of it is become inaccessible. And among this population, they found 400,000 blocks have been corrupted, so these blocks can be written to the file systems but the content has been corrupted. Okay? All right. And this is not about storage subsystems that we use in a daily basis, this is kind of still million dollar storage systems that they sell to the customer, but they still see it as kind of failures. Okay? Okay. So the whole point is that storage subsystems fails in a diverse way and what we look again is how today file system checkers should kind of repair any inconsistency or any damage in your file systems, and it will also look at the file system component part which is a part of the operating systems and how they deal with storage failures. And again although my research has been within these core operating systems I believe the principle and the techniques can be extended to other research areas from personal devices to kind of a large scale cloud computing as well. Yes? >>: Why don't you [inaudible] a need for an offline checker [inaudible] updates [inaudible]. >> Haryadi Gunawi: All right. Because over the time your data can be corrupted not because of crash, right? Sometimes your media just went out or sometimes you have a hap crash, so it scratch a portion of your disk so you will lose some part of your disk. And so in that case, your file system looks inconsistent. Right? So program, your middle directory might be lost, so you will lose access to all kind of the lower directories. So you need file system checkers to repair everything. All right. Let's look on to my first contributions which is SQCK. So what I'm going to show you is the problems that I found today and today I find file system checkers and of course the solution after that is SQCK, how we can simplify the constructions of file system checkers and then let you deal with SQCK. So before showing you the problems we found, let me just briefly describe the file system data structures. So in this case I take EXE2 file system. So if we have to [inaudible] EXE2 file system is always start with the super block that has informations about the layout of the file systems. So EXE2 file system is divided into groups. This group is described by a group that skips a block. So it's group that skip block has informations about locations of final table. And table is just basically array of iNotes. So an iNote can represent a file or it can also represent the directory. And a iNote basically has pointers to the data blocks so if the I note represents a file, then the data block is your users file -- is your data, but if the I note is directory basically data blocks contains directory entries. Okay? And if the directory is large or the file is large, the I note points to indirect block which eventually points to data blocks. Okay? Now, when we talk about storage failures that can be like sector errors where some blocks become inaccessible, so if you do not replicate your I notes then you might lose some of your directories or files. If you do not replicate the indirect block for example you might lose all the pointers to the data block. Okay? So this kind of the failures of the file systems. And to handle -- other corruptions can also occur. For example, the I note has kind of pointer to the locations of data in their indirect block. That pointer can be corrupted for example suddenly it points to super block, okay? And kind of file system checkers needs to catch this kind of corruptions and fix that. Okay? All right. So let's see the reliability of today's file system checkers. In this case, I analyzed EXE2 file system checker. So the task of E2FSCK is just a typical FSCK, basically it's just to cross-check all the internal metadata, find any inconsistency and repair that inconsistency. Okay? So for example for E2FSCK it needs to check that indirect pointer should not point to the super block because indirect pointer should point to an indirect block. It should also check that a subdirectory should only be accessible from one directory and basically in E2FSCK needs to do a total of 150 cross-checks. Okay? So in order to understand the reliability of today E2FSCK basically I injected single corruption at a time. And we want to see how E2FSCK fixed the corruption, how it repairs the corruption, okay? So I want to show you an example of kind of this fault injection. The first example that I corrupted in indirect pointers, it points to super block, and we'll see how E2FSCK repairs this. And the second example I corrupt a directory entry is just that it point to another directory. So let's see the first example. So here's one problem that I found in today E2FSCK. This is a problem of inconsistent repair. What I mean by this is the resulting file systems actually become more inconsistent and become more unusable, okay. So this way inconsistent. And the reason is because the out of order repair that we will see soon. So what I'm going to show you here first is what an ideal FSCK should do. So let's say we have an iNote and iNote has this indirect pointer and the indirect pointer is corrupted so that it points to the super block. What you want to do is to check the validity of the pointer, right? And since this is a corrupt pointer we want to clear that pointer so that it points to nothing. But if the pointer is correct, what we want to do next is to check the content of the block that it is pointing to. For this indirect block, the check that we want to do is we scan its entry so its entry is basically the locations of the actual data blocks. And for its entry we want to find locations that fall outside the file system range. So let's say if the third entry looks corrupt it will kind of fix that entry to zero. Okay. So that's what an ideal FSCK should do. Let's see what E2FSCK does. So what E2FSCK does, this is the whole idea of out of order repair because it assumes that the super block is an indirect block. So it assumes this indirect pointer is correct. So in that case, it will try to check the content of the indirect block, which is actually a super block, and since the super block doesn't look like a healthy indirect block of course obviously it accidentally clears some of the fields in the super block. And later it checks the validity of the pointer and, oops, turns out that the indirect pointer is a pointer in that block and it clears that. But as you can see, the super block has been corrupted. It's too late. Okay. There is one problem that we call out of order repair. Let me show you the second -- yes? >>: What is the goal in fact here? Is it to creating usable [inaudible] or [inaudible]. >> Haryadi Gunawi: Yes, two goals. Usable. And the second thing is to repair the file system to match the original file system to the greatest extent possible. And I'll show you in the second example here the resulting file system is usable, but it didn't match the original file systems to the greatest extent possible. Okay? >>: So format would also meet that test. >> Haryadi Gunawi: I'm sorry? >>: Format -- if I reformat the volume, I end up with a consistent file ->> Haryadi Gunawi: Right. But [inaudible] right. >>: It doesn't really do very well ->>: It doesn't match [inaudible]. >> Haryadi Gunawi: Yeah. Thanks for bringing that up. So here another example of incorrect repair. So I will show you what an ideal FSCK should do. So let's say we have a directory A1, A1 has subdirectory A2, a directory B1 and B1 has a subdirectory B2. And let's say we have a corruption such that the directory entry in A1 suddenly points to B2. Well, fortunately in the FSCK file systems for its directory, we maintain a backward pointer to its actual parent. So everyone can know the true parent child relationship. Which is B1 and B2. And we can identify the corrupt pointer and A2 will be put in lost and found. Okay? So that's what an ideal FSCK should do. >>: [inaudible] point, too? >> Haryadi Gunawi: Yeah. But you cannot -- the whole idea is that what kind of pointers that you trust. Since A2 is just one pointer pointing to A1 and A1 doesn't claim that A2 is his child, so I mean in today E2FSCK you cannot claim that that's the true parent child relationship. Okay? >>: [inaudible] sort of lost and found. >> Haryadi Gunawi: Lost and found is just a directory underneath the root directory. So if there is any directory or any file that is not reachable from the root directory when FSCK is run, those files and directories will be put inside the lost and found directory. All right. So what happens in E2FSCK is brutal because it selects A1 as the actual parent because coincidentally the A1 has a lower iNote number than B1. So all the B2 claims that B1 is the actual parent, E2FSCK doesn't care, it forces B2 to accept A1 as the actual parent and B1 is just a subparent because he just lost his kid. Okay. And so this is pretty much kidnapping problem that happens in the E2FSCK. But the whole point is that E2FSCK does not use all of abelian formations to form correct repair. Yes? >>: Do you have any sense for how common these surface failures are [inaudible]. >> Haryadi Gunawi: Good. Well, we don't have kind of statistical failure at kind of this [inaudible] at the high level. But as an offline file system checkers, you need to deal -- you need to handle the worst case of failure. Right? This is kind of obvious case that you should not perform a repair like this because you have enough information that where you can kind of repair things correctly. All right. So as the summary of problems you have seen the first two which is inconsistent repair and again consistent but not correct repair because the resulting file system doesn't match the original file systems to the greatest extent possible, and also other problems that I'll have to talk about this offline. But the whole point is that when we have these problems what we can say is just let's fix this problem. Let's just eliminate this problem. But if you want to do that in the current framework that wouldn't be so easy. And in fact, in one FSCK [inaudible] the same thing to me, you might introduce more problems. And the reason for that is because the complex implementations of today file systems checkers. All checkers that I know of are still written in this low-level C code language which is hard to reason about. So as a result, the resulting implementation is large and complex. For example, E2FSCK needs to do like 150 checks in six [inaudible] lines of code while the EXE2 file system itself is kind of less than 10,000 lines of code. FSCK needs to do 240 check in 22,000 lines of code. And if you look at the code, it's just basically hundreds of clarity check statements. Okay? And there's several bad implications. It's difficult to combine abelian formations on a disk to perform correct repairs. It's difficult to enter correct ordering of repairs. Basically it's hard to find missing checks or incorrect checks in this current framework because of the clarity code. Okay? >>: Is there a reason that FSCK is a harder PC code to write than -- because 16,000 lines of code is not substantial. NTFS is 350,000 which [inaudible] sometimes works. Right? So is there some particular reason to believe that FSCK is harder than other ->> Haryadi Gunawi: Right. So what I'll show you is that the basic test of FSCK needs to do hundreds of cross-check. That's what it needs to do essentially, right? But in terms of the implementation what I'll show you is we do not need to implement things in C code which is basically what happens in today file system checkers, it clutters all the data traffic so all kind of loading the data and cross-checking, right? So basically I'll provide a better framework for doing this. >>: [inaudible] something fundamental checking the file system that makes it more difficult to reason about in building the file system itself? >> Haryadi Gunawi: Yes, well in a sense I believe yes. Because it needs to find any inconsistency, right? And for its inconsistency sometimes you want to do a certain kind of repair. That's why when people ask me why don't I write like a modeling language that kind of tell what's the truth about the file systems, the problem is that even if I use modeling language usually it's hard to express the repair. Usually what FSCK does is to find inconsistency and for that specific inconsistency you want to do a specific repair. So that's where the hard part. And it's to find any possible damage in your file systems. Yes? >>: So [inaudible] file system code is [inaudible] and all these are either caused by ->> Haryadi Gunawi: Oh, no. For file system checkers you need to anticipate that the file system code can be buggy which leads to basically at the end what you have is inconsistency on the disk. Right. >>: So in a way [inaudible] you have two specifications, right? One is the file system code itself that implements a specification and then the check [inaudible] implemented [inaudible] the same file system. So do you check whether these two match or ->> Haryadi Gunawi: Well, I wouldn't [inaudible] but it will be an interesting step to do. Uh-huh. All right. So bottom line what happen -- yes? >>: [inaudible] talked about formulating this problem as an optimization problem because [inaudible] recover some pointers that from a node to another node I'll say [inaudible] address in the [inaudible] given all the [inaudible]. And you can look at the whole thing globally and try to find something atomize globally. >> Haryadi Gunawi: I think that's a good point. I haven't explored that. Well, I believe with my framework that I will show later it's probably easier to do that because all the checks are kind of basically sequence kind of in a better way than kind of clutter code. Yes. Okay. So bottom line what happens out there is that this FSCK code is untouchable because this is a very crucial recovery code, but it's so hard to fix. If you fix this and you introduce more problems then you are in bad shape. Okay? So obviously we need new solutions. I'll talk about SQCK architectures and how we can write simple checkers. So the whole point is we need to build a better file system checker framework. And again, the point here is that the original task of FSCK is already complex because it needs to cross-check many stuff. And we do not want to combine it with another of these signs which will lead to complexity and unreliability. Okay. So the last one is that I want complexity is the enemy of reliability. I want to simplify the framework without losing any power or sacrificing any performance. So the whole idea with SQCK is a robust full system checker that uses a declarative query language such as a SQL so we can write hundreds of checks and repairs in a very clear and compact manner. The whole point if you look at the nature of a check is please find an inconsistency or my file systems and you look at the nature of our query pretty much the same thing, please find something in the database. Yes? >>: Well, are you assuming you can't make any changes to the file system to add additional check -- additional information as it's being generated to help you later in FSCK? >> Haryadi Gunawi: Well, okay -- no, you can do that. Because when you do -I'll show you how you can use SQCK. Basically first we load file system data to the database tables. So during that particular -- I will tell you more in the next slide. Okay? >>: Okay. >> Haryadi Gunawi: All right. Thanks. All right. So there are lots of benefits again the high level intent of the checks can be query specified as I will show you soon. Basically what you carry is you just write fewer lines of code. And basically it's easy to cross-check and repair message from one of the information because that's what a SQL query language was build kind of from day one to do. Okay? All right. So let's see how we can use SQCK. We take file system emates and basically we load file system metadata to the database tables. And if you want to add information while you are reading the emates to help the checks you can do this in this particular phase. Okay. And then the whole point is that since all informations that we want to cross-check and repair is such in a database tables we just write all these checks and repairs in a declarative manner with a query language, okay? And if there's any modifications, we will flush that modifications to the file system emates so that the resulting file system is consistent. Yes? >>: It's going to be [inaudible]. >> Haryadi Gunawi: Yes. This is in memory database. Well, so far what we have -- I haven't really looked into the delimitations. If we need a backup storage how we're going to do that, but even if today file systems E2FSCK, for example, if the memory doesn't fit, the E2FSCK [inaudible] that it cannot run. So I think that's kind of another design issues when we're designing file system checkers. Good point. >>: [inaudible]. >> Haryadi Gunawi: Sorry? >>: You can't have a serious process [inaudible] property that you can't run FSCK on it if it's [inaudible] its metadata didn't if it [inaudible]. >> Haryadi Gunawi: Yeah. So that's why I'm in today E2FSCK. What they do is sometimes they build the summary and then if the summary they find inconsistency in the summary, they will read the metadata again. So again that's another design issue. Well, for me it will be since we just run a database and as long as the database can use a backup storage they'll be just I believe it's just another installation of backup storage for this case. Okay? But ->>: [inaudible] that you could safely write to while you have an inconsistent file ->> Haryadi Gunawi: Yeah, yeah. So that's kind of the vulnerability because when we run this, when the storage itself might be broken for that particular time. >>: That's something that you cared about. >> Haryadi Gunawi: I'm sorry? >>: You might find out that your database just [inaudible] you care about, but you didn't know because the file system was inconsistent when you [inaudible]. >>: He did say [inaudible]. [brief talking over]. >> Haryadi Gunawi: Okay. All right. So in the next couple of slides I'm going to show you how we can write simple checks with this SQCK. Yes? >>: [inaudible] loaders pretty simple, right? It's pretty straightforward. >> Haryadi Gunawi: Yes. >>: One of your energies is complexity and knows what's going on in the file system and you hope that step is very simple. >> Haryadi Gunawi: Yes. >>: Right? >> Haryadi Gunawi: Uh-huh. >>: So is that true? >> Haryadi Gunawi: Yes, very true. That's very true. Yeah. If you know the file system structures you just write the file system structures. And when I compare SQCK with the original FSCK I only compare this part. I do not compare the loading -- the scanner part because it's very easy. There's not much complex logic there. All right. So here's how we can write simple checks. So this is one check that E2FSCK needs to do. It needs to find block bitmap that is not located within its block group, this is a very simple range checking. And here's what you get with E2FSCK. The core logic of the check is hidden in this implementation details, but here's what you get with SQCK is very simple. The query will just retain block bitmap but that's reside within start block and end block group. Okay? I'm going to show you kind of a little bit more complex example. This is again the idea of we are trying to find the false parents with directory entries that point to the subdirectory that already belongs to another parent. So in this case, we need to cross-check all directory entries and as we note this is wrongly implemented in E2FSCK which leads to kidnapping problem. So here's what you get in E2FSCK. Well, no one will understand this code unless you write this code. And anyway, this is the wrong implementation. Okay? So let's just throw this away and we can -- I can introduce you to a new query which fix the check. So we do a three simple selections. First, we scan all child pointers. So in this case, we kind of omit in tree number one and two because entry number one and two it's a dot N dot dot entry. Those two entries should be checked in another query. Basically so after this first selections we will have P saying that C is your child and we will do a second selections where we scan all parents pointers, so after -- so we scan entry number two, it's just a dot dot entry. So after the second selections we have C saying that P is his parent, so we can establish the true parent-child relationship here. And we also just do another third selection where we find F which is not equal to P, but F also claims that C is their child, and we will just written information about the false parents. Yes. >>: [inaudible] sort of code complexity, it seems like there's still a lot of stuff on this slide, right, as compared to the other slide. And so you're just expressing it in this SQL way which you sort of you find in your experience is simpler to revisit? Because it still seems like you have a lot of condition also and ->> Haryadi Gunawi: Right. But if you look, a check is basically expressed in this one simple query. If you look at the original E2FSCK if you want to find a particular checks you find a kind of C code and you need to know where this data structure is being read before and where it's stored and everything like that. So the whole point like the data transfer so and the logic of the check is very cluttered. But in here you don't care about the data transfer though because the database community has done a great job for finding a SQL, right? Here you just express the logic of the check and it will do the transfer all by itself. >>: I'm just wondering how much [inaudible] you get back during the C code. I mean, if you just sat down and ->> Haryadi Gunawi: Right. The -- in terms of the logic of the code, I'll show you later that we can write much fewer lines of code with SQL statement. Okay? >>: And it seems like the sort of, I don't know, maybe get lambda calculus you could prove and things that sort of you don't have one test here that's almost undoing some other test in a way that I don't think you could do when writing it by M and C. I mean, as you go to you said 340 checks for one file system, I mean, how do I know that test 37 isn't undoing what test 97 just fixed. >> Haryadi Gunawi: Right. So there's another question how do you check the checker. This is kind of [inaudible]. >>: Right. >> Haryadi Gunawi: So what I've done is to simplify the framework so that hopefully we can have a better world by verifying the checker rather than verifying the C code. Okay. All right. So that's with SQCK, so I have developed SQCK in four axis of evaluation. >>: But if all you did so far is collect, find problems [inaudible]. >> Haryadi Gunawi: I show you. I have fixed the -- fixed the existing problems and I even introduced new repairs. >>: I thought you were getting into evaluation already, and I didn't understand how you ->> Haryadi Gunawi: Yes, flexibility and reliability. Okay. Thanks. Okay. So for access of evaluation, simply superior reliability, flexibility and performance. So in terms of simplicity, I have written E2FSCK in the form of 150 queries for a total of just 1,000 lines of code of SQL statements. And we need the C code to combine the -- all this C code, but the C code is very simple. You just combine all the SQL statements. But you can compare that with 16,000 lines of the original implementation in C code. And reliability, I don't claim that I have tested everything, but so far we have injected hundreds of corruptions in areas and SQCK has passed all of them. But the point is that again if you find a check or repair is missing you just add a query. If you find a check or repair is buggy, you just simply fix the query. Okay? >>: [inaudible] lines of code were written in the [inaudible] the C code? Or did you -- because you [inaudible]. >> Haryadi Gunawi: My SQL. I just used my SQL. >>: So I don't know too much about this area, but you said that you injected several different false or failure examples into the file system. [inaudible] but I don't find that surprising since it's you who wrote the tests. >> Haryadi Gunawi: Yeah. That's why I mentioned that I don't claim that I have 100 percent coverage. What I have done is I injected this hundredth corruption to NREOs and if I found that my checker doesn't fix that, I'll add a query. But I agree with you that the whole point is that there will be another issue of how we can generate like a 100 percent coverage to test any corruption in my file systems. But my point is that with the -- the point is in this two bullet point is if you find -- that you have missing repairs or missing checks you just add a query. You do not need to add code in E2FSCK like in C code. >>: Are there any problems with your implementation discovered that the original C code will not detect? >> Haryadi Gunawi: I will show you later. There's some kind of a checks that are missing. Okay. >>: So I had a related question to that. So you said one of the goals was to also detect bugs in the operating system code with [inaudible] twice to [inaudible]. So why not inject [inaudible] in the file system and then see if you're able to fix them? >> Haryadi Gunawi: Right. That kind of relates to the testing part. I rarely kind of done that. But it can be doable basically. But I mean the whole point is that the bugs in the file system code, I mean, you can see that as kind of corruptions underneath the storage. At the end from the file FSCK perspective is that the file system is inconsistent. So you need to find any inconsistency in the file systems and repair that. >>: That would be a further test of your ->> Haryadi Gunawi: Right, right, right. So that will be kind of in this testing role that I won't fully explore. Yes? >>: Can you give us some intuition for how many times it took you to get things right, so that the real question, the expressiveness of the language is you know, if you hadn't had this test infrastructure to keep running to getting it right, how many times could you have done it? >> Haryadi Gunawi: Oh, [inaudible] so two months before that, I borrowed my SQL book from the library, so that shows you how we can -- what I cannot tell that to the program committee. But that's kind of the whole point how we can simply find things with this query. >>: The intuition number of iterations that when you went through to check [inaudible]. >> Haryadi Gunawi: Oh, yeah, yeah, yeah. >>: You get the first time, did it take two ->> Haryadi Gunawi: Yeah, you take two, three iterations. But you just focus on this very one kind of localized query, right? You do not need to handle other like other -- it's not cluttered basically. >>: You're 90 percent you got it right the first time? >> Haryadi Gunawi: Depends on the cross-check. I mean, I don't have the slide, but their cross-check that you must done with multiple instances of multiple different structures. So for that particular cross checks it's hard. But there's some cross-check where you can do with across some fields within one structure. And that's simple. You can do one iteration and you get it right. So depending on the type of the cross-check that you do. >>: If some corruption scenario is not being handled, is it [inaudible] to figure out which [inaudible] to fix or like there are 150 queries, so is it ->> Haryadi Gunawi: Right. So the next thing is what you can do is you can get in the whole ideas, go back to how to you check the checker, right? It would be nice to have a formal model of what the checker should do and compare the model to the sequence of the query that I run. >>: [inaudible] my mind it looks like -- I mean you have 150 queries that you see some corruption is not being fixed. And how do you go back from that observation to which [inaudible]. >> Haryadi Gunawi: Right. So the checker basically, the model is define it in two phases, okay? So in the first phase, for example, you check the super block, you check the group disk code block. So pretty much just a very minimal structures that you check. So if your corruption basically kind of deals with those few structures, then you do it in the first phases. But if your corruptions relate to like a link [inaudible] or different, more complex check usually you do that in different phases. Right? So you must get the sense of what the phases are within today for some checkers and if you find a question that relates to certain phase you try to look at the queries for that particular phase. Does that answer? >>: What's the most complicated discovery in the 150, like how many blocks of iNotes this query has to check? Because the example you showed like a [inaudible] right, A1, B1, B2, then you only need the [inaudible] what's the most complicated need [inaudible] blocks or iNotes [inaudible]. >> Haryadi Gunawi: Yeah, well, that one is actually -- I mean, if you look at the C code it's complex, but turns out if you look, the queries are not that complex. Sometimes you do like complex cortex, like link count. But the point is that you need the complex check is cross-check that you must do across multiple structures and across many multiple instances of ->>: That's kind of like my SQL anyway, right? I mean [inaudible]. >>: Since he's only halfway through his talk, can we wait a little bit and let him ->> Haryadi Gunawi: Okay. In terms of flexibility, the point is that I have improved it to FSCK [inaudible] the wrong repairs and doing repairs, so these are the type of repairs that are -- you cannot find in today [inaudible]. But the whole take-away point is that you can add these new repairs in a few lines of code. Okay? And performance, just quickly here. So we compare E2FSCK of four different file systems with different sizes. Its file system is half full. The Y axis shows that although runtime now [inaudible] to E2FSCK. So here's E2FSCK time in terms of seconds, and here's SQCK time to E2FSCK. But the whole point is that SQCK stays within the 1.5 times the original E2FSCK performance and there are many optimizations that we can introduce. >>: [inaudible]. >> Haryadi Gunawi: Right. So for this one, we have like 300,000 files makes small files large files. All right. So what do we have is this idea that problems in today false checkers and how we can have a better world with SQCK. So I'm going to move on to the file system part. So again I'm going to show you the problems today in file system checkers, and I'm just a co-author for this particular paper. It was led by [inaudible] who is a researcher at Microsoft Silicon Valley. But I'm going to solve these problems with I/O shepherding and then we can evaluate the I/O shepherding. Okay? So let's see the problems that we found together with [inaudible] and other colleagues here. So here basically we [inaudible] reliability and the way we do that is we inject read and write folds of different block types. And when we inject these faults we run different faults on operations. Okay. So the result that I'm going to show you is basically in today file system we have this problem of inconsistent, incomplete scattered and inflexible policies. And I'll show you the progression and how we can come up with this conclusion. Okay. So basically we want to measure the reliability of read recovery policies, so we run a bunch of workload, open, read, write, and so on. And when we run this workload, we inject read failures on different block types, okay? So and basically all the gray boxes means unapplicable [inaudible] and we found four recovery strategies in EXE3 read recovery policies. So let's just take one simple box here. So for this particular [inaudible] what we did is we run a [inaudible] workload like CD and the pad, and during this [inaudible] workload we injected write failures on the indirect block. And what the policy that we observe is that the failure is detected and is propagated to the application, so we mark this with kind of horizontal bar that's mix propagate, okay? But the take away point here, you can just focus on this red area. For the same write failure on the indirect block, the EXE3 file systems take different recovery responses. Sometimes it propagates, sometimes it tries the operation, sometimes it ignores the failure and sometimes stops the file systems. Okay? So that's how I can -- we can conclude that this is consistent. And this is the right recovery policies. And just quickly you can just focus on all these circles. It means that all write failures happening when you run this workload and all the write failures happening on these block types are being ignored by the file systems. So if this happens in your case, then you -- your data or your file will be lost suddenly. Okay? So this is what I call incomplete policies. Okay. So basically how should we deal with storage failures? Well, there are lots of techniques out there. And it has performance and reliability trade-off. The point here that I want to make is sometimes what you want to do is you want to deploy a particular set of policies. Let's say you want to add mirroring and check stubs in the EXE3, okay? So the same principle, we can just do that, and what I'm going to show you is that if we do that in the current framework, it's not so easy. And the reason for that is because of scattered reliability code in EXE3 file systems. Okay? So the X axis showed the source files in the EXE3 and the Y axis shows the line number that shows the storage failure minutes in these file systems. So as you can see basically there are hundreds of places. So when I want to introduce new policies in this file systems you must do it like everywhere. So that's why I -- it's inflexible. >>: What do you mean by live reading code? >> Haryadi Gunawi: Well basically the file system needs to handle read failures, write failures, right? And the file system do that in many different locations and what the file system tries to do is handle its fault in its I/O location. So that's why everything is scattered. Does that answer your questions? It's basically handling read and write faults of different block types. Okay. So we have this problem, and it's not only happening in the EXE3 but also in [inaudible] when I talk informally to people in the storage industry they may also mention that this million dollar property file systems also have the same problems. But of course we cannot really say much about that. So it means that we need a new reliability framework in dealing with storage failures. Okay? So I'm going to show you the idea of I/O shepherding, the goals in architecture. So the whole point with the shepherd is this. This is a nice layer underneath the file system. It's pretty much integrated to the file systems where it locally takes care of the reliability of the request. The file systems will extend read and write, and it will be interpreted by the shepherd. And we'll see later how we can enter reliability within the shepherd. So there are three important goals. The first one is as you can see the original policies are very much like scattered throughout the file systems code. What we want to do is we want to localize everything within the shepherd, okay. Because the whole idea that with localized policies, we can have more correct, less buggy and much more simple reliability management. And probably you can do verifications within that layer, too. The second thing is since everything is localized, it's very easy to achieve flexibility. Okay? We can deploy different kind of policies for different environments or requirements. For example you might want to add mirroring for to protect archival volume or you want to add checks to protect corruptions of scientific data. But the whole point is that different policies for different requirements and/or environments. Okay? And the third goal is that we can combine one of more just basic policies to four more powerful policies. Okay. So to achieve this [inaudible] yes? >>: Can you [inaudible] what layer the shepherd is at? Does it think about blocks, does it think ->> Haryadi Gunawi: Blocks. Blocks, yes. But it's pretty much integrated with the file system because sometimes what you want to do within the shepherd is to understand the block types within the file systems. So you need to understand the EXE3 block types. If you want to deploy shepherding for Windows NTFS, for example, you need to note the windows NTFS block types within the shepherd. Yes? >>: So how does the low level subsystem know whether it's archival or are you assuming ->> Haryadi Gunawi: Assume file system administrators. Yes. So basically I assume that the file system administrators who will write this kind of policies. I mean so far when we have file system usually it's the policy that the file system developers writes, right, and file system administrators needs to understand the file system code in order to change the policy. But the whole point is that the whole within the shepherding layer we allow file system administrators to compose different policies because everything will be simple in this layer. All right. So to achieve the goals I'm going to show -- describe a little bit about the four architecture -- four important components in the I/O shepherd, the policy table, policy code primitive and policy metadata. Okay? So basically we want to build a new reliability framework in file systems so we asked the first questions, how do we specify reliability policies? Because we do not want to introduce another ad hoc solution. So basically I come up with these two different facts that the first fact is that different -- usually file systems has different block types and different block types have different levels of importance. For example, super block might be more important than iNote block or iNote block is more important than data block, okay? And the second fact is again different volumes usually require different reliability levels. If you have an archival volume, usually you want more protections than a temporary volume. So what we need here is the ability to specify fine grain policies and we can achieve that with the policy table. So the first benefit of policy table is we can deploy different policies for different block types. So, for example, here if we see a right to -- if the shepherd sees a right to the super block, we just say we want to execute these three mirroring code with real mirroring to the super block but for iNote or iNote with [inaudible] block we say that we want to add checksum and parity. For data we do not want to add redundancy, so we just do a simple reentry. Okay? The second benefit is that we allow file system administrators to choose different -- to instruct different policy tables for different volumes, again for full volume you can have policy table that employs more protection than the policy table used in a temporary volume. Okay? So the second question in building this architecture is how we can write simple but powerful policies. But in short basically we profiled lots of primitives that are very usable across different policies. The whole idea is that we try to hide complexities of handling these failures behind these primitives. And since that's the case, then the job of the policy writer like the file system administrators are just to compose these primitives into a fully formed policy code, okay? So the policy code in general should be very simple. >>: Do they [inaudible] ->> Haryadi Gunawi: C code basically. But it's just simplified. Okay? All right. So let's put it all together. So normally if you have a request you just send this to the storage subsystems. With shepherd we can modify the request or add more I/Os to add the reliability. So for example if we see a write to data block D, we can specify a policy table that will say that for data block we want to mirror the data. So this -- so every time you see a write to data block, it will haul this policy code, and the policy code should be simple, it just calls a bunch of primitives. So, for example, it will look up the mirror map that is provided by the shepherd so the mirror -- let's say this is a new data block. There is no replica yet. So the policy code will call another primitive to the file systems asking for a new block for the replica block, and then it will update the mirror map such that indicating that D is mirrored to R. And we can write the code that sends these two blocks to the disk. >>: This is mapped out [inaudible] within the file systems? >> Haryadi Gunawi: The mirror map? >>: The map allocate function? Like, are you calling each ->> Haryadi Gunawi: No, this one is calling back to the file systems because we must always appear to ->>: So the shepherd is linked into the file system apparently. >> Haryadi Gunawi: Yes, yes, right. [inaudible]. And the mirror map will be outdated by a background of primitive. So the whole point that we can interpose I/Os and we can enter reliability in a very simple way. Okay? So in terms of implementation the shepherd infrastructures is around 3500 lines of code and much of this code is reusable for other file systems. Some of this code is very much integrated with the file systems so like the block types allocations and everything. And in terms of integrations to the operating systems, I need to modify EXE3 but so far the modification is just 900 lines of code. But I want just to emphasize the most challenging integrations with this integrating the I/O shepherding with this consistency management in the EXE3, in this case, the general layer. And my apology, I cannot show the problem because it needs 30 to 45 minutes to establish the problem and the solution. But the whole point is that the shepherd adds new data and metadata to add reliability. And those additional data and metadata needs to be consistent in the presence of crisis. So when we -- I want to integrate that to the journaling, I found that there's some major flow in the journaling layer, so I said it cannot react to the checkpoint failures. And I can talk about that offline. But the whole point is that due to this major flow, that's why you see that all write failures are being ignored in EXE3 IBM DFS and FS2. That's because of this major flow. >>: So [inaudible] present even without your I/O shepherding [inaudible]? >> Haryadi Gunawi: The flow -- if I want to formally define the flow is that when we look at individual block failures, the journalling cannot handle that. The journalling layer only handles consistency in the presence of crisis. But when it basically the whole point of journalling, you put your blocks in the journal area first and then you checkpoint it to the final locations. When you checkpoint this to the final locations, the current journalling layer is that says that you cannot change the transaction has been committed there. So if you want to do remapping or mirroring or something like that, where you change the metadata that has been committed to the transaction, it's too late. So if I understand your answer, this file shows up any time you try to extend the file system. Is that right? >> Haryadi Gunawi: No. Even let's say that -- let's just not take I/O shepherding, let's just you want to add remapping policy in the EXE3 file. You cannot do that. >>: Okay. >> Haryadi Gunawi: You cannot do that and achieve consistency in the presence of crisis. Because of -- of this major flow. >>: So [inaudible] pretty much analogous to logging the databases. So do you have any idea what the commercial databases do when they see write failures when they're trying to do installs? >> Haryadi Gunawi: So in database they have this idea like compensating transactions. So I mean the whole idea with database, they kind of flock like the old data also, along with the new data. But in terms of dealing with individual block failures, I haven't really looked into how database management systems handle. >>: [inaudible] compensating transactions for dealing with [inaudible] and things that are higher level rather than immediate failure. >> Haryadi Gunawi: Right. Yeah. >>: You must have faced this problem because it's exactly the same problem. >>: [inaudible] already decide the transaction ->>: If you admitted the transaction, right, you've got the director in the log and the check pointer is running along trying to install stuff in the log. >> Haryadi Gunawi: Right. >>: Into the volume and -- I mean, maybe it just blows the volume. But that's harsh. >>: It pops up the message by a better disk. >>: Well, I mean, yeah, we have the [inaudible] and you could delay write back fail, right, which is not very helpful. >> Haryadi Gunawi: I think the idea about this is that when we think about individual block failures coming into the picture, then we look at how today's systems try to handle that, it's not really complete in the sense ->>: [inaudible]. >> Haryadi Gunawi: Yeah, I don't know. >>: Usually they're ahead of us, so ->> Haryadi Gunawi: Yeah. So, yeah, so one of the [inaudible] contributions about edit this whole idea of chain transactions, and I can talk that more offline. >>: So I have a question [inaudible]. >> Haryadi Gunawi: I'm sorry? >>: Not [inaudible]. >> Haryadi Gunawi: Pretty much it's like -- it's simple changes. Like, for example, the shepherd needs to call back to the file systems and giving new block to ->>: [inaudible] how many files for this 900 lines [inaudible] was it like five lines in every file or was it like mostly all in a couple of places. >> Haryadi Gunawi: No, mostly localized. Like for example, allocation policies. The reason why I need to introduce more is because the original file system code is so messy such that I cannot reuse the original file systems code and it takes part of the code and write this new code. Yup. And part of this is the consistency. >>: [inaudible] mentally add up the sizes of all the files in EXE3 that you had, I came up with about 10,000. Is that how big it is? >> Haryadi Gunawi: The EXE3 file system? 20,000. >>: That's what I thought. >> Haryadi Gunawi: Yeah. >>: Okay. So you're still adding [inaudible]. >> Haryadi Gunawi: Yeah. Well part of this is this new solution transaction, right, and it's just one of the thing that I must do to preserve consistency. >>: 20,000 include a JPD [inaudible]? >> Haryadi Gunawi: Yes. The full [inaudible] >>: [inaudible] transactions that were [inaudible] because it seems like ->> Haryadi Gunawi: Yeah, yeah, yeah, you can. All right. So let's [inaudible] with I/O shepherding briefly here. So I'll show you how we can achieve flexibility, [inaudible] process and simplicity. So the whole point again, this is the original [inaudible] policies that can be summarized in this picture. So let's say we want to throw away all these policies and let's say we want to add a simple retry more policies. We just install a policy table that looks like this. If you see a read to any block type, it will call this policy code. It will read the block. And if it keeps failing, it will retry. And if the retries keep failing, it will just returns to the failure. Okay? So I ran all the workload again for inject the faults again and basically as you can see, we can have a different kind of recovery model. Okay. And we can -- show you how we can do fine grain policies. So say we want to change the EXE3 [inaudible] policies to custom policies like a combines of different policies depending on the block types, but the take away point again here is kinds of the overall behavior, you can just focus on the cover. If it is a write failures under data blocks it will do what I tell in the policy table which is try and propagate and the same thing for general and metadata block. Okay? And writing the policies overall simple and so far we have just -- I've just written eight policies and the most complex one can be written in 80 lines of code. Okay. All right. So let me just briefly describe -- yes? >>: [inaudible] how this is philosophically different from something like software read where I would explicitly try to do something like -- doing something like mirroring software, right, at the block level, has been done before so ->> Haryadi Gunawi: Right. It goes back to that slide that I kind of skim a little bit fast is that there are lots of techniques out there, and lots of techniques not about write. Sometimes you want to do checksum and sometimes you want to do parental checksum. Like you want to do the checksum not next to the block but like in other different area. Right? Sometimes you want to do like ead after write just to make sure that your writes are not lost. So basically the point is that they're kind of different kind of policies, and the shepherd just allowed that you deploy that kind of policies. >>: Okay. But is one of your other [inaudible] results is a better software architecture is to put all of these within the block level rather than scattered them throughout the file system level? >> Haryadi Gunawi: Yes. >>: To move many of these policies that you found scattered throughout. >> Haryadi Gunawi: Yes, policies that build with block level ->>: They need to go lower down? Okay. And this was not being done in something like the EXE3 file system? >> Haryadi Gunawi: No. >>: So [inaudible] how is it that [inaudible] you have a set of aspect [inaudible]. Looks like you are taking [inaudible] recovery aspects of [inaudible]. >> Haryadi Gunawi: Good questions. Honestly I haven't really explored it. What I have read is one title that there's some researchers that mention that AOP doesn't work for failure handling and I've talked to one professor at Northwestern who says when you do AOP, there are too many -- when we talk about full handling and aspect oriented, there are too many cross-cutting concerns but to be honest I don't know about all the details. But people, people that have looked into AOP and failure handling, they say that it -- they do not real [inaudible] okay. But there. I mean I love to talk about that later if you want. Okay? All right. So I will not give you too many technical details. Again, this is just -- I just have a couple of pictures that I want to show. So you can just sit back and relax. So let me show you the first one is just static analysis of error propagation. So the problem definition is simple. You want to do one operation so you call a bunch of functions, these functions will call a bunch of other functions as well. And if they are low-level failures, error code says EIO, that stands for error related code will be propagated and this error code must be propagated as long as the error has not been handled or as long as there's no pre-NK statement for example. At least you should not have silent failures. But as you can see this using calls basically ignored all the written error codes that are written by the screen functions, okay? So this is what we call as bad calls because failures are being ignored. So even though there are some failures, they do thing [inaudible] access to the caller. Okay? Which is bad. So you want to understand the magnitude of this problem? I wrote this static analysis but here's the result in the EXE3 file systems, so all this note represents functions, all the edges represents function calls and error code propagates upward and all the function and function calls appear in this [inaudible] are just cuts by the error code. So in reality there are many more function and function calls, okay? So again you can focus on the [inaudible] because those are the places where failures are being ignored. So here's our [inaudible] pretty much 35 places where failures are ignored. IBM JFS 61, NFS client 54 and behold XGFS 105 places, okay. And this is the most complex file system. Okay. So when I did the [inaudible] analysis basically what I found, this is not current case. But there's somewhat a pattern. For example, write failures are ignored many more than read failures. Okay? So I believe this is another hint of this kind of reliability design issues in the two day file systems. Maybe it's hard to recovery from failure, maybe it's hard to role back in the middle of operations. And you can look at the comments that I will show you later that kind of hinted all these problems. So many questions that are still to be answered in this particular research. >>: [inaudible] to have more [inaudible]. >> Haryadi Gunawi: Because I have silent failures. So there are failures but the failures are not handled and it's not even printed. >>: So but how do you know it's not just [inaudible]. >> Haryadi Gunawi: Oh, for some, I mean, for the file system research goes back to the file system research we see that write failures are really being ignored. When we look at the I/O operation center the file system we do not see any more I/O so it means that write fails, too bad. Okay. All right. So and also I've done some -- a little bit work on database and database DBMS and data corruptions. The whole idea that the file system is not the only system that manages data in the storage subsystems. DBMS also stores its own internal metadata structures. So basically we injected corruptions to my SQL on these data structures and we want to see whether my SQL could tolerate such failures, and it turns out that the results are damaging, the server crash. Sometimes wrong results are written to the users which leads to security problem. Sometimes records are lost. And I think the more ridiculous part is that my SQL also has a repair utility and this repair utility ignores the [inaudible] but sometimes it even tries to corrupt metadata. So it also crashes when it wants to do the repair. Okay? So ->>: Did you try it with a [inaudible] kind of an academic tool. >> Haryadi Gunawi: My SQL and postscripts, postscripts almost pretty much have the same problem. Well, Jeff, no, didn't say that they [inaudible]. >>: [inaudible] commercial. >>: You should look at DB2 because that one's been subsumed. >> Haryadi Gunawi: Yeah. Jeff ignore to actually told me that he can run this on DB2, but he cannot tell me the result anyway. So. >>: [inaudible]. A license agreement might ->> Haryadi Gunawi: Okay. >>: Again, I've done some work on the constructing kind of [inaudible] the whole idea that I develop techniques and develop policies without looking in the single lines of source code and there are many benefits that we can do from the results. I've also done a little bit work on deploying that work surfaces and user reliable. So the whole point is that there are lots of TCP note verses out there, but you must convince the operating system developers to deploy your TCP in the kernel. So the questions for us is how we can expose information and safe control to the applications so that we can deploy work surfaces at the user level, okay, and I love to talk this offline if you want. Let me just conclude with some future work. I'm going to show you some comments that I found in kind of problematic cases in file systems. So EXE3, there's no way of reporting error to user space and so let's just ignore it. XFS just ignore errors at this point. There's nothing we can do anyway except to try to keep going. Should we pass any errors back? Yes. SCSI to do, I think we need to handle failure. Okay. So I think the whole point is that there are many rely built problems in today file systems, even the storage drivers. So that's why my roadmap is about the need to measure reliability first and to build powerful reliable file systems with simplicity in its design. And what I've done pretty much in one sentence is to make today file systems including file system checkers more reliable by kind of removing unnecessary complexities in its design but without sacrificing power and performance. So again with offline file systems checker, I've shown you how we can build -how we can simplify the constructions of file systems checkers with the use of the [inaudible] query language and for file systems how we can simplify reliability managements with the shepherding layer. Okay? In terms of my future work, I have my vision is to fold -- the first one is to build highly reliable systems and I believe there are three important requirements. The first one I believe that failure management should be revisited. I found major flaw in today journalling file systems. I still found flaws in error code propagations in today file systems, and there are still many questions to be answered. And when I talk to some systems [inaudible] admit that this -- some of these problems that they are aware of but they -- since they just let it go, they do not have kind of a real solution. So at least from my part, there are lots of results that I can use as a database of problems to [inaudible] designs issues in file systems reliability. The second research is about simplifying systems. Because I just believe that tomorrow's systems would be much more complex and much more larger than today's systems. And if we keep writing this large and important software in the lower level approach it might be hard to manage the failures. So it's kind of a great challenge to come up with high level approach to describe how large systems should operate. And maybe if we can come up with the good high level approach, maybe again we can verify and formalize the reliability of these large systems. Okay? And the second vision is to build highly available systems and simply I believe that reliability cannot stand alone. If you have a reliable systems but low availability or low performance most people will not use that. And this is a lesson I learned from the search industry is that there's one case where they need to take off the file systems for four hours just to cross-check database of data and the users are really mad. So there are many works that can be done in this area. One kind of a shortened project is how we can build a fast online repair as part of the file system itself. So removing the need of offline checker. Okay? Let me just briefly the scope of my -- describe the scope of my future work. I love to work with cross-system in the areas. So in my research you have seen how I steal ideas from other areas [inaudible] and databases. Again, I'll always [inaudible] large critical systems beyond file systems. I've done a little work on DBMS. It will be interesting to see other systems like distributed systems, cloud computing as well. And again I really look forward to see my opportunity in evaluating these new trends. Storage [inaudible] is one example. I want to look at how systems build on top of these new drives. And it's just fascinating that today's system is not just about one machine or one operating system. So again I also look the opportunity to look into how I will deal with cloud computing. Okay? So let me just conclude with this quote from Tony Horn [phonetic]. He said that the unaffordable price of reliability [inaudible] but it didn't mean that today we should remove the features that we have in today's systems. But so we must accept that tomorrow's systems will have more features, it will be more larger than today's systems. So that's why in my research I basically try to solve this new challenge is how to build large reliable systems with simplicity in its design. Okay? So that's it. I'll take more questions now. >>: What in terms of about how common partial failures are. >> Haryadi Gunawi: Okay. >>: And so did those measurement studies that you talked about at the beginning talk document that partial failures are a more common than whole disk failures? >>: You mean intermittent failures? >>: Like [inaudible]. >>: Well, for whatever class of partial errors is relevant to this talk, right? Either intermittent failures or corrupted data. Like either things get fixed by retry or things that were data corruption on the disk. >> Haryadi Gunawi: The whole time that it happens, and it depends on the -turns out it depends on the storage sites also. If you buy a set of graphs that's more cheaper they find that they have more partial failures than SCSI drives. And depending on the kind of like infant more tilt if you deploy the drives within the first or two years, sometimes you see more partial failures than if probably if the disk has been around for three or four years. And there's also like correlation that sometimes if you see a latent sector failures, usually you'll find more sector failures in that disk because, I mean, in overall the disk has just gone bad, right? So things just happen. >>: But how does it compare to [inaudible]. >> Haryadi Gunawi: To all this ->>: [inaudible]. >> Haryadi Gunawi: Right. So I mean I can look at the numbers, but so far I don't have the numbers on top of my head. But I mean the whole thing from the storage industry is that when they look at these things, they need to build kind of new stuff to deal with this partial failures. >>: [inaudible] failures [inaudible] but not enough [inaudible] these problems are real serious problems. [brief talking over]. >>: I'm trying to understand like what is unique about file systems and people have tried to make, you know, other systems simple and there's similar approaches of uses [inaudible] query languages and calling up privacy PCs, independent components. What do you think is -- what do you think the unique challenges are there in the file system where there are ->> Haryadi Gunawi: The main. >>: [inaudible] approaches. >> Haryadi Gunawi: Most challenging part is it's state full. What you store on the disk is non volatile. If you look compared to like network and networking has done a great job in dealing with kind of [inaudible] and everything. I mean you look back it's on the fly. But if you look at network, I mean I can argue that pretty much you send packets and if it doesn't come back, you try more basically. But if you look at file systems you started maybe you do not look that data for one more year but one year after now you expect the data to be still there on the disk, right? So that's I think the most kind of the unique challenge in file system because the whole idea that they stay kind of full system. >>: [inaudible]. >> Haryadi Gunawi: Right. So I mean agreed. So when we look at P2P then the file systems [inaudible] comes into that picture. >>: [inaudible] in the file system [inaudible]. >>: So you look at the cloud based storage, [inaudible] data to some service provider and there are, you know, there are service level requirements that says [inaudible] guarantee a certain level. So in that context, do you think file system [inaudible] is too [inaudible] for end user? Because I can just, you know, [inaudible] for giving you this service provider and not have to worry about this thing. >> Haryadi Gunawi: So let me just -- just let me know if I get this correct or not. But even when we talk about cloud computing, one of the challenging part in cloud computing today stated by Yahoo cloud architect and many other architect is data management and how do you deal with failures at a large scale? And nobody have looked into failures that are large scale as large as like cloud computing framework. What we look is single drives failures and everything. So when we look at large scale failures and the whole principle of reliability comes into the picture, also, how we should deal with replication, data managements in such a large scale failures. But from end user, I agree, they will be at the end is kind of a certain reliability guarantee that the infrastructure profiles and the user -- I mean usually I mean users are lazy in the sense that you do not want to add these details to the users. So it will be -- yeah, I mean my research is about giving more power to this infrastructure developers to build more flexible and more powerful policies. >>: I want to [inaudible] I still -- again, I'm also not the [inaudible] file systems but it seems to me like that technology of the disk basically has been out for a long time and people [inaudible] reliability 10 years ago and I don't have a sense that this somehow less reliable today than they used to be. And my question is kind of on the -- which is [inaudible] do you think like we're going to be working on reliability for the disk 10 years from now. Like is this -- are we making progress on this problem or is it -- I mean ->> Haryadi Gunawi: At the end of the story if you look at the disk, for example, the density of the disk becomes more compact, right, you have like a very lots of bits per square of your drive. And if you look at some articles, they mention that when you want to build firmware on those particular drives it's very hard, it's very complex, because you deal with very tiny, little things. So that's why that's kind of [inaudible] comes in. When you have more complex hardware to deal with, usually the firmware is not as perfect as what you want. That's why I mean originally it's just anecdote but if you look at articles starting like two years ago people say how this drive firmware can be [inaudible] our loss. And that's not just because of the media itself but because of the firmware. So it really depends on the technology, on the full storage subsystems, it's not just the hardware itself. >>: It's actually tragic because we didn't understand how disk failed until the last five years because nobody had 100,000 disks sitting around study to see how these [inaudible] your disk has a million lines of source code, and they just keep stuffing more functionality to it because the chips keep getting bigger. >>: So the answer is nobody looked 10 years ago to see if [inaudible] were doing stuff like this, so we don't know the answer to your question, but this anecdotal evidence of databases finding [inaudible] rights and other behaviors that couldn't be explained by a correction -- correctly functioning stack. So maybe the disks were broken back then and maybe it was the software and the database and probably it was all [inaudible]. >>: It seems like everything else keeps getting more reliable, then you -- disks will become more and more of the less reliable thing. They become the longest pole in the tent over time. So you sort of have to keep up with the rest of the stack. >>: So I have a question. Which is kind of [inaudible] on some other people asked, which is failure, lack of air handling or inappropriate handling of errors is a common problem to [inaudible] every code base not just disk code base. What have you learned that could be [inaudible] I mean so let's say you decide tomorrow you're going to start looking at [inaudible] compilers or kernels or [inaudible] what have you learned that through the use of -- or is everything you learned -- is everything you figured out, all the tricks that you figured out, are they specific to this? >> Haryadi Gunawi: I guess the tricks, yes. But the principle again is very much extended. Just take one example for cloud computing, okay? There's this work by Jeff Walter [phonetic] from UCSD, and he mentioned if you look at the kind of this 10 cloud computing software, if you look at the code, the surface management part and the data management part is very cluttered. So if you want to express different kind of policies for different kind of data management, you cannot do that in that particular software. So what they do is they kind of decouple those two and how do you decouple those two will be famous technical to the problems that you want to solve. So pretty much I would say that the principle can be extended but the details just I believe you just need to find your own tricks, yeah. >> Ed Nightingale: All right. Thank you. >> Haryadi Gunawi: Thank you. [applause]

>> Ed Nightingale: I think we'll get started. ... my great pleasure to introduce Haryadi Gunawi. He has...

Related documents

Products

Support

&gt;&gt; Ed Nightingale: I think we'll get started. ... my great pleasure to introduce Haryadi Gunawi. He has...

Related documents

Add this document to collection(s)

Add this document to saved

Suggest us how to improve StudyLib

>> Ed Nightingale: I think we'll get started. ... my great pleasure to introduce Haryadi Gunawi. He has...