>> Ed Nightingale: I think we'll get started. ... my great pleasure to introduce Haryadi Gunawi. He has...

advertisement
>> Ed Nightingale: I think we'll get started. Jay will probably get started. So it's
my great pleasure to introduce Haryadi Gunawi. He has done great work in a
wide variety of forums, in SSOP and OSGI and FAST and ISCA, giving us a
deep understanding of the way storage systems propagate and handle failures in
both hardware and software. So today he's going to be telling us about how we
can create more reliable storage systems. So thank you.
>> Haryadi Gunawi: All right. Thanks, Ed. And thank you all for coming. So
today I'm going to talk about my research contributions towards building more
reliable storage systems. So again, just like Ed says, my research is about file
systems, which is an important piece of the operating systems. And if you look at
years of file system research basically we will find many, you know, innovations
and many different aspects.
For example, people try to improve performance by improving a location and
policies, improving live performance and many other aspects. Functionality has
also received a great deal of attention. People try to make the file systems
scaleable so that we can store millions of files. People try to incorporate search
interface within the file system so that we can easily search our files.
Well, our Internet reliability has also been a focus simply because the hardware
fails and no one wants to lose their data. So what I want to focus in this talk is to
focus on reliability aspects of the file systems. Okay?
So since my research is about reliability, here's the roadmap of my reliability
research. So my research was driven from this unfortunate reality that storage
systems still fail today and the failures that arise are quite diverse, as we will see
soon.
And the file system it's two minutes all this kind of failures. So the first questions
that I ask is how do we know that today file systems are doing a good job in
dealing with search failures? The first part of my research is to measure the
reliability of today's file systems in dealing with storage failures. Turns out what I
have found is that in today file systems, the failure management is very complex
which leads to many problems that I will show you later. Okay? So given this
another unfortunate reality, the second part of my research is to build more
reliable file systems that can deal with storage failures in a better way. Okay?
And in order to do that, I will adhere to the principle that complexity is the enemy
of reliability, so my hope is to bring today file systems and so that we can build
tomorrow file systems that are powerful, reliable but with simplicity in its design.
Okay?
So given such a roadmap, here are my contributions to the more concrete way.
So since this is a file system research, there are two important components that I
have looked at. The first one is this file system checkers, or in short FSCK; and
the second one is the file system itself. So when we talk about FSCK, this is a
very important utility that should repair any damaged, you know, file systems and
bring your file system to a consistent and useable state.
But when I measured the reliability of today's file system checkers, I found they
are not really reliable, I found that some repairs are missing and some repairs
are actually corrupt so the resulting file systems become unusable because they
bring more damage to your file systems. Okay? And I believe the reason why
this problem exists is because of the complex implementations of today's
checkers. All checkers that I know of are still written in this low-level C code
language which is hard to reason about. Okay.
So to improve these situations, I will introduce you to the idea of SQCK, which is
the robust file system checker that uses a declarative query language such as
SQL so that we can write hundreds of check and repairs in a much more clear
and compact manner.
Now, when we look at the file system itself, again turns out today's file systems
are not really reliable because some storage failures are being ignored and the
existing policies are inflexible, they are very hard to change. So if you want to
add new policies, it's just hard to do in today's file systems. And I've list the
problems exist because of the reliability code that deals with storage failure in
today's file systems is very scattered and there's no good abstractions in dealing
with storage failure. So in order to improve these situations I will introduce you to
the idea of I/O shepherding, which is a novelty reliability layer where we can
deploy centralized flexible and powerful policies. Okay.
And again, in profiling the solutions my principle is about simplicity. All right. So
here's the outline of my talk. So in the next section, I'm going to give kind of a
little background of my research. Why are we still doing reliability research
although file systems have been out there for more than two decades? Okay?
And I'll show you how storage subsystems fails today and then I'll move on to my
two major contributions which is improving the constructions of file systems
checkers with SQCK and simplifying reliability management with I/O
shepherding. And for the rest of the time I'm just going to briefly describe my
research in other just a preliminary area of software engineering that are basis
distributed systems and networking, and then I'll [inaudible] work and conclusion.
All right. So why are still we doing kind of reliability research? Because I believe
the failure models that there are new failure models that haven't been looked at
really, kind of in more details. But if you look at the old fault models, there are
two fault models that have been pretty much solved today. The first one is to
anticipate whole system crash, and the whole idea people realize that the whole
system can crash and that's not good for a file system, because file system
performs multiple updates. And if a crash happens in the middle, you can have
inconsistency on a disk. Okay? So basically file system researchers were the
idea of logging and journalling so that we can perform automatic updates to the
disk.
The second fault models is to anticipate whole disk failures. Again, people
realize that the whole disk can be lost and if that's a case, then our data can be
lost permanently. So Patterson, Gibson, et al basically suggested the idea let's
just have more disk and if we have more disk we can do a simple mirroring like
this or to reduce this pace overhead, what we can do, we can do a parity
scheme. But the whole point that if a disk is lost, we can still reconstruct the
data. Okay? But we will see that today's modern storage failures are much more
diverse beyond whole system crash and whole disk failures; and so the model of
failure has changed a little bit, and I will show you why.
The reason is because what's underneath the file systems is not just a simple
disk, but it's a stack of complex components. We have device driver, we have
the controller, we have the firmware, [inaudible] components and media. And the
point is that all these components are not always reliable. People have seen
how these components can fail in practice.
So basically my point is that from the file system perspective there's a broad
range of failures that can arise. Not only we have whole disk failures where we -but we also have partial failures where some [inaudible] on the disk become
inaccessible.
Sometimes we have intermittent faults, sometimes we have permanent faults,
and sometimes we have these kind of crazy failure modes where the write's
suddenly misdirected, turn, or lost. Okay? So over the time, because all this
problem or data can be lost and corrupted. Now, if you wonder how disk failures
happen in practice -- yes?
>>: I couldn't follow why are these new failures [inaudible] I mean [inaudible].
>> Haryadi Gunawi: Right. This -- right.
>>: You are seeing these failures [inaudible] exist before?
>> Haryadi Gunawi: Yeah. It exists before, but in the community they consider it
anecdotes. It's like they don't believe that disk failure happens unless ->>: Not that the failures are new, it's that the [inaudible].
>> Haryadi Gunawi: Right. Yes, yes, yes.
>>: You just said that the failures are new.
>> Haryadi Gunawi: Right. I guess new in the sense that we look -- we need to
look at this kind of failure much that we haven't looked before. So I guess new in
that kind of sense. Right. Okay.
>>: [inaudible] fairly low rate failures but as you get bigger and bigger storage
systems the chance of their occurring gets larger.
>>: So as I understand it, much of what you said applies to disk, but nowadays
there are solid state [inaudible] flashes starting to replace disks in many sites,
including wireless. Can you say a few words about that?
>> Haryadi Gunawi: Very good point. Well, first is that I don't think many people
have looked into the reliability aspects of flash devices. That's the first thing.
And if you look at again the storage file systems it's that when we look at the
flash drive, it's just kind of the lower level component that has been changed.
But in the middle you will have kind of this device driver for drives, right? And
when kind of disk flash drives are becoming more popular usually the device
[inaudible] will be much complex in dealing with any kind of things that are within
the device.
So when we talk about file systems, it needs to deal with all the things that
happening underneath this, not just kind of the media stuff.
>>: [inaudible] going to be possibly applicable to flash [inaudible].
>> Haryadi Gunawi: Yes, I believe so. Thank you. All right. So as you can see,
these failures happen in practice, and I want to just focus on the last two, which
is a large scale study done by my colleague and people from the storage industry
in this case network appliance. So they study 1.5 million drives that they kind of
sell to their customer and within this population, four percent of the drive, actually
a bit lighter in sector errors where some part of it is become inaccessible. And
among this population, they found 400,000 blocks have been corrupted, so these
blocks can be written to the file systems but the content has been corrupted.
Okay?
All right. And this is not about storage subsystems that we use in a daily basis,
this is kind of still million dollar storage systems that they sell to the customer, but
they still see it as kind of failures. Okay? Okay. So the whole point is that
storage subsystems fails in a diverse way and what we look again is how today
file system checkers should kind of repair any inconsistency or any damage in
your file systems, and it will also look at the file system component part which is
a part of the operating systems and how they deal with storage failures.
And again although my research has been within these core operating systems I
believe the principle and the techniques can be extended to other research areas
from personal devices to kind of a large scale cloud computing as well. Yes?
>>: Why don't you [inaudible] a need for an offline checker [inaudible] updates
[inaudible].
>> Haryadi Gunawi: All right. Because over the time your data can be corrupted
not because of crash, right? Sometimes your media just went out or sometimes
you have a hap crash, so it scratch a portion of your disk so you will lose some
part of your disk. And so in that case, your file system looks inconsistent. Right?
So program, your middle directory might be lost, so you will lose access to all
kind of the lower directories. So you need file system checkers to repair
everything.
All right. Let's look on to my first contributions which is SQCK. So what I'm
going to show you is the problems that I found today and today I find file system
checkers and of course the solution after that is SQCK, how we can simplify the
constructions of file system checkers and then let you deal with SQCK.
So before showing you the problems we found, let me just briefly describe the file
system data structures. So in this case I take EXE2 file system. So if we have to
[inaudible] EXE2 file system is always start with the super block that has
informations about the layout of the file systems. So EXE2 file system is divided
into groups. This group is described by a group that skips a block. So it's group
that skip block has informations about locations of final table. And table is just
basically array of iNotes. So an iNote can represent a file or it can also represent
the directory. And a iNote basically has pointers to the data blocks so if the I
note represents a file, then the data block is your users file -- is your data, but if
the I note is directory basically data blocks contains directory entries. Okay?
And if the directory is large or the file is large, the I note points to indirect block
which eventually points to data blocks. Okay? Now, when we talk about storage
failures that can be like sector errors where some blocks become inaccessible,
so if you do not replicate your I notes then you might lose some of your
directories or files. If you do not replicate the indirect block for example you
might lose all the pointers to the data block. Okay? So this kind of the failures of
the file systems.
And to handle -- other corruptions can also occur. For example, the I note has
kind of pointer to the locations of data in their indirect block. That pointer can be
corrupted for example suddenly it points to super block, okay? And kind of file
system checkers needs to catch this kind of corruptions and fix that. Okay?
All right. So let's see the reliability of today's file system checkers. In this case, I
analyzed EXE2 file system checker. So the task of E2FSCK is just a typical
FSCK, basically it's just to cross-check all the internal metadata, find any
inconsistency and repair that inconsistency. Okay? So for example for E2FSCK
it needs to check that indirect pointer should not point to the super block because
indirect pointer should point to an indirect block. It should also check that a
subdirectory should only be accessible from one directory and basically in
E2FSCK needs to do a total of 150 cross-checks. Okay?
So in order to understand the reliability of today E2FSCK basically I injected
single corruption at a time. And we want to see how E2FSCK fixed the
corruption, how it repairs the corruption, okay? So I want to show you an
example of kind of this fault injection. The first example that I corrupted in
indirect pointers, it points to super block, and we'll see how E2FSCK repairs this.
And the second example I corrupt a directory entry is just that it point to another
directory. So let's see the first example.
So here's one problem that I found in today E2FSCK. This is a problem of
inconsistent repair. What I mean by this is the resulting file systems actually
become more inconsistent and become more unusable, okay. So this way
inconsistent. And the reason is because the out of order repair that we will see
soon.
So what I'm going to show you here first is what an ideal FSCK should do. So
let's say we have an iNote and iNote has this indirect pointer and the indirect
pointer is corrupted so that it points to the super block. What you want to do is to
check the validity of the pointer, right? And since this is a corrupt pointer we
want to clear that pointer so that it points to nothing.
But if the pointer is correct, what we want to do next is to check the content of the
block that it is pointing to. For this indirect block, the check that we want to do is
we scan its entry so its entry is basically the locations of the actual data blocks.
And for its entry we want to find locations that fall outside the file system range.
So let's say if the third entry looks corrupt it will kind of fix that entry to zero.
Okay. So that's what an ideal FSCK should do. Let's see what E2FSCK does.
So what E2FSCK does, this is the whole idea of out of order repair because it
assumes that the super block is an indirect block. So it assumes this indirect
pointer is correct. So in that case, it will try to check the content of the indirect
block, which is actually a super block, and since the super block doesn't look like
a healthy indirect block of course obviously it accidentally clears some of the
fields in the super block. And later it checks the validity of the pointer and, oops,
turns out that the indirect pointer is a pointer in that block and it clears that. But
as you can see, the super block has been corrupted. It's too late. Okay. There
is one problem that we call out of order repair.
Let me show you the second -- yes?
>>: What is the goal in fact here? Is it to creating usable [inaudible] or
[inaudible].
>> Haryadi Gunawi: Yes, two goals. Usable. And the second thing is to repair
the file system to match the original file system to the greatest extent possible.
And I'll show you in the second example here the resulting file system is usable,
but it didn't match the original file systems to the greatest extent possible. Okay?
>>: So format would also meet that test.
>> Haryadi Gunawi: I'm sorry?
>>: Format -- if I reformat the volume, I end up with a consistent file ->> Haryadi Gunawi: Right. But [inaudible] right.
>>: It doesn't really do very well ->>: It doesn't match [inaudible].
>> Haryadi Gunawi: Yeah. Thanks for bringing that up. So here another
example of incorrect repair. So I will show you what an ideal FSCK should do.
So let's say we have a directory A1, A1 has subdirectory A2, a directory B1 and
B1 has a subdirectory B2. And let's say we have a corruption such that the
directory entry in A1 suddenly points to B2. Well, fortunately in the FSCK file
systems for its directory, we maintain a backward pointer to its actual parent. So
everyone can know the true parent child relationship. Which is B1 and B2. And
we can identify the corrupt pointer and A2 will be put in lost and found. Okay?
So that's what an ideal FSCK should do.
>>: [inaudible] point, too?
>> Haryadi Gunawi: Yeah. But you cannot -- the whole idea is that what kind of
pointers that you trust. Since A2 is just one pointer pointing to A1 and A1 doesn't
claim that A2 is his child, so I mean in today E2FSCK you cannot claim that that's
the true parent child relationship. Okay?
>>: [inaudible] sort of lost and found.
>> Haryadi Gunawi: Lost and found is just a directory underneath the root
directory. So if there is any directory or any file that is not reachable from the
root directory when FSCK is run, those files and directories will be put inside the
lost and found directory. All right. So what happens in E2FSCK is brutal
because it selects A1 as the actual parent because coincidentally the A1 has a
lower iNote number than B1. So all the B2 claims that B1 is the actual parent,
E2FSCK doesn't care, it forces B2 to accept A1 as the actual parent and B1 is
just a subparent because he just lost his kid. Okay. And so this is pretty much
kidnapping problem that happens in the E2FSCK.
But the whole point is that E2FSCK does not use all of abelian formations to form
correct repair. Yes?
>>: Do you have any sense for how common these surface failures are
[inaudible].
>> Haryadi Gunawi: Good. Well, we don't have kind of statistical failure at kind
of this [inaudible] at the high level. But as an offline file system checkers, you
need to deal -- you need to handle the worst case of failure. Right? This is kind
of obvious case that you should not perform a repair like this because you have
enough information that where you can kind of repair things correctly.
All right. So as the summary of problems you have seen the first two which is
inconsistent repair and again consistent but not correct repair because the
resulting file system doesn't match the original file systems to the greatest extent
possible, and also other problems that I'll have to talk about this offline. But the
whole point is that when we have these problems what we can say is just let's fix
this problem. Let's just eliminate this problem.
But if you want to do that in the current framework that wouldn't be so easy. And
in fact, in one FSCK [inaudible] the same thing to me, you might introduce more
problems. And the reason for that is because the complex implementations of
today file systems checkers. All checkers that I know of are still written in this
low-level C code language which is hard to reason about. So as a result, the
resulting implementation is large and complex. For example, E2FSCK needs to
do like 150 checks in six [inaudible] lines of code while the EXE2 file system itself
is kind of less than 10,000 lines of code. FSCK needs to do 240 check in 22,000
lines of code. And if you look at the code, it's just basically hundreds of clarity
check statements. Okay? And there's several bad implications. It's difficult to
combine abelian formations on a disk to perform correct repairs. It's difficult to
enter correct ordering of repairs. Basically it's hard to find missing checks or
incorrect checks in this current framework because of the clarity code. Okay?
>>: Is there a reason that FSCK is a harder PC code to write than -- because
16,000 lines of code is not substantial. NTFS is 350,000 which [inaudible]
sometimes works. Right? So is there some particular reason to believe that
FSCK is harder than other ->> Haryadi Gunawi: Right. So what I'll show you is that the basic test of FSCK
needs to do hundreds of cross-check. That's what it needs to do essentially,
right? But in terms of the implementation what I'll show you is we do not need to
implement things in C code which is basically what happens in today file system
checkers, it clutters all the data traffic so all kind of loading the data and
cross-checking, right? So basically I'll provide a better framework for doing this.
>>: [inaudible] something fundamental checking the file system that makes it
more difficult to reason about in building the file system itself?
>> Haryadi Gunawi: Yes, well in a sense I believe yes. Because it needs to find
any inconsistency, right? And for its inconsistency sometimes you want to do a
certain kind of repair. That's why when people ask me why don't I write like a
modeling language that kind of tell what's the truth about the file systems, the
problem is that even if I use modeling language usually it's hard to express the
repair. Usually what FSCK does is to find inconsistency and for that specific
inconsistency you want to do a specific repair. So that's where the hard part.
And it's to find any possible damage in your file systems. Yes?
>>: So [inaudible] file system code is [inaudible] and all these are either caused
by ->> Haryadi Gunawi: Oh, no. For file system checkers you need to anticipate
that the file system code can be buggy which leads to basically at the end what
you have is inconsistency on the disk. Right.
>>: So in a way [inaudible] you have two specifications, right? One is the file
system code itself that implements a specification and then the check [inaudible]
implemented [inaudible] the same file system. So do you check whether these
two match or ->> Haryadi Gunawi: Well, I wouldn't [inaudible] but it will be an interesting step to
do. Uh-huh. All right. So bottom line what happen -- yes?
>>: [inaudible] talked about formulating this problem as an optimization problem
because [inaudible] recover some pointers that from a node to another node I'll
say [inaudible] address in the [inaudible] given all the [inaudible]. And you can
look at the whole thing globally and try to find something atomize globally.
>> Haryadi Gunawi: I think that's a good point. I haven't explored that. Well, I
believe with my framework that I will show later it's probably easier to do that
because all the checks are kind of basically sequence kind of in a better way
than kind of clutter code. Yes. Okay.
So bottom line what happens out there is that this FSCK code is untouchable
because this is a very crucial recovery code, but it's so hard to fix. If you fix this
and you introduce more problems then you are in bad shape. Okay? So
obviously we need new solutions. I'll talk about SQCK architectures and how we
can write simple checkers.
So the whole point is we need to build a better file system checker framework.
And again, the point here is that the original task of FSCK is already complex
because it needs to cross-check many stuff. And we do not want to combine it
with another of these signs which will lead to complexity and unreliability. Okay.
So the last one is that I want complexity is the enemy of reliability. I want to
simplify the framework without losing any power or sacrificing any performance.
So the whole idea with SQCK is a robust full system checker that uses a
declarative query language such as a SQL so we can write hundreds of checks
and repairs in a very clear and compact manner. The whole point if you look at
the nature of a check is please find an inconsistency or my file systems and you
look at the nature of our query pretty much the same thing, please find something
in the database. Yes?
>>: Well, are you assuming you can't make any changes to the file system to
add additional check -- additional information as it's being generated to help you
later in FSCK?
>> Haryadi Gunawi: Well, okay -- no, you can do that. Because when you do -I'll show you how you can use SQCK. Basically first we load file system data to
the database tables. So during that particular -- I will tell you more in the next
slide. Okay?
>>: Okay.
>> Haryadi Gunawi: All right. Thanks. All right. So there are lots of benefits
again the high level intent of the checks can be query specified as I will show you
soon. Basically what you carry is you just write fewer lines of code. And
basically it's easy to cross-check and repair message from one of the information
because that's what a SQL query language was build kind of from day one to do.
Okay?
All right. So let's see how we can use SQCK. We take file system emates and
basically we load file system metadata to the database tables. And if you want to
add information while you are reading the emates to help the checks you can do
this in this particular phase. Okay.
And then the whole point is that since all informations that we want to
cross-check and repair is such in a database tables we just write all these checks
and repairs in a declarative manner with a query language, okay? And if there's
any modifications, we will flush that modifications to the file system emates so
that the resulting file system is consistent. Yes?
>>: It's going to be [inaudible].
>> Haryadi Gunawi: Yes. This is in memory database. Well, so far what we
have -- I haven't really looked into the delimitations. If we need a backup storage
how we're going to do that, but even if today file systems E2FSCK, for example,
if the memory doesn't fit, the E2FSCK [inaudible] that it cannot run. So I think
that's kind of another design issues when we're designing file system checkers.
Good point.
>>: [inaudible].
>> Haryadi Gunawi: Sorry?
>>: You can't have a serious process [inaudible] property that you can't run
FSCK on it if it's [inaudible] its metadata didn't if it [inaudible].
>> Haryadi Gunawi: Yeah. So that's why I'm in today E2FSCK. What they do is
sometimes they build the summary and then if the summary they find
inconsistency in the summary, they will read the metadata again. So again that's
another design issue. Well, for me it will be since we just run a database and as
long as the database can use a backup storage they'll be just I believe it's just
another installation of backup storage for this case. Okay? But ->>: [inaudible] that you could safely write to while you have an inconsistent file ->> Haryadi Gunawi: Yeah, yeah. So that's kind of the vulnerability because
when we run this, when the storage itself might be broken for that particular time.
>>: That's something that you cared about.
>> Haryadi Gunawi: I'm sorry?
>>: You might find out that your database just [inaudible] you care about, but
you didn't know because the file system was inconsistent when you [inaudible].
>>: He did say [inaudible].
[brief talking over].
>> Haryadi Gunawi: Okay. All right. So in the next couple of slides I'm going to
show you how we can write simple checks with this SQCK. Yes?
>>: [inaudible] loaders pretty simple, right? It's pretty straightforward.
>> Haryadi Gunawi: Yes.
>>: One of your energies is complexity and knows what's going on in the file
system and you hope that step is very simple.
>> Haryadi Gunawi: Yes.
>>: Right?
>> Haryadi Gunawi: Uh-huh.
>>: So is that true?
>> Haryadi Gunawi: Yes, very true. That's very true. Yeah. If you know the file
system structures you just write the file system structures. And when I compare
SQCK with the original FSCK I only compare this part. I do not compare the
loading -- the scanner part because it's very easy. There's not much complex
logic there.
All right. So here's how we can write simple checks. So this is one check that
E2FSCK needs to do. It needs to find block bitmap that is not located within its
block group, this is a very simple range checking. And here's what you get with
E2FSCK. The core logic of the check is hidden in this implementation details,
but here's what you get with SQCK is very simple. The query will just retain
block bitmap but that's reside within start block and end block group. Okay?
I'm going to show you kind of a little bit more complex example. This is again the
idea of we are trying to find the false parents with directory entries that point to
the subdirectory that already belongs to another parent.
So in this case, we need to cross-check all directory entries and as we note this
is wrongly implemented in E2FSCK which leads to kidnapping problem. So
here's what you get in E2FSCK. Well, no one will understand this code unless
you write this code. And anyway, this is the wrong implementation. Okay? So
let's just throw this away and we can -- I can introduce you to a new query which
fix the check. So we do a three simple selections.
First, we scan all child pointers. So in this case, we kind of omit in tree number
one and two because entry number one and two it's a dot N dot dot entry. Those
two entries should be checked in another query. Basically so after this first
selections we will have P saying that C is your child and we will do a second
selections where we scan all parents pointers, so after -- so we scan entry
number two, it's just a dot dot entry. So after the second selections we have C
saying that P is his parent, so we can establish the true parent-child relationship
here. And we also just do another third selection where we find F which is not
equal to P, but F also claims that C is their child, and we will just written
information about the false parents. Yes.
>>: [inaudible] sort of code complexity, it seems like there's still a lot of stuff on
this slide, right, as compared to the other slide. And so you're just expressing it
in this SQL way which you sort of you find in your experience is simpler to
revisit? Because it still seems like you have a lot of condition also and ->> Haryadi Gunawi: Right. But if you look, a check is basically expressed in this
one simple query. If you look at the original E2FSCK if you want to find a
particular checks you find a kind of C code and you need to know where this data
structure is being read before and where it's stored and everything like that. So
the whole point like the data transfer so and the logic of the check is very
cluttered. But in here you don't care about the data transfer though because the
database community has done a great job for finding a SQL, right? Here you just
express the logic of the check and it will do the transfer all by itself.
>>: I'm just wondering how much [inaudible] you get back during the C code. I
mean, if you just sat down and ->> Haryadi Gunawi: Right. The -- in terms of the logic of the code, I'll show you
later that we can write much fewer lines of code with SQL statement. Okay?
>>: And it seems like the sort of, I don't know, maybe get lambda calculus you
could prove and things that sort of you don't have one test here that's almost
undoing some other test in a way that I don't think you could do when writing it by
M and C. I mean, as you go to you said 340 checks for one file system, I mean,
how do I know that test 37 isn't undoing what test 97 just fixed.
>> Haryadi Gunawi: Right. So there's another question how do you check the
checker. This is kind of [inaudible].
>>: Right.
>> Haryadi Gunawi: So what I've done is to simplify the framework so that
hopefully we can have a better world by verifying the checker rather than
verifying the C code. Okay. All right. So that's with SQCK, so I have developed
SQCK in four axis of evaluation.
>>: But if all you did so far is collect, find problems [inaudible].
>> Haryadi Gunawi: I show you. I have fixed the -- fixed the existing problems
and I even introduced new repairs.
>>: I thought you were getting into evaluation already, and I didn't understand
how you ->> Haryadi Gunawi: Yes, flexibility and reliability. Okay. Thanks. Okay. So for
access of evaluation, simply superior reliability, flexibility and performance. So in
terms of simplicity, I have written E2FSCK in the form of 150 queries for a total of
just 1,000 lines of code of SQL statements. And we need the C code to combine
the -- all this C code, but the C code is very simple. You just combine all the
SQL statements. But you can compare that with 16,000 lines of the original
implementation in C code.
And reliability, I don't claim that I have tested everything, but so far we have
injected hundreds of corruptions in areas and SQCK has passed all of them. But
the point is that again if you find a check or repair is missing you just add a
query. If you find a check or repair is buggy, you just simply fix the query. Okay?
>>: [inaudible] lines of code were written in the [inaudible] the C code? Or did
you -- because you [inaudible].
>> Haryadi Gunawi: My SQL. I just used my SQL.
>>: So I don't know too much about this area, but you said that you injected
several different false or failure examples into the file system. [inaudible] but I
don't find that surprising since it's you who wrote the tests.
>> Haryadi Gunawi: Yeah. That's why I mentioned that I don't claim that I have
100 percent coverage. What I have done is I injected this hundredth corruption
to NREOs and if I found that my checker doesn't fix that, I'll add a query. But I
agree with you that the whole point is that there will be another issue of how we
can generate like a 100 percent coverage to test any corruption in my file
systems. But my point is that with the -- the point is in this two bullet point is if
you find -- that you have missing repairs or missing checks you just add a query.
You do not need to add code in E2FSCK like in C code.
>>: Are there any problems with your implementation discovered that the original
C code will not detect?
>> Haryadi Gunawi: I will show you later. There's some kind of a checks that
are missing. Okay.
>>: So I had a related question to that. So you said one of the goals was to also
detect bugs in the operating system code with [inaudible] twice to [inaudible]. So
why not inject [inaudible] in the file system and then see if you're able to fix
them?
>> Haryadi Gunawi: Right. That kind of relates to the testing part. I rarely kind
of done that. But it can be doable basically. But I mean the whole point is that
the bugs in the file system code, I mean, you can see that as kind of corruptions
underneath the storage. At the end from the file FSCK perspective is that the file
system is inconsistent. So you need to find any inconsistency in the file systems
and repair that.
>>: That would be a further test of your ->> Haryadi Gunawi: Right, right, right. So that will be kind of in this testing role
that I won't fully explore. Yes?
>>: Can you give us some intuition for how many times it took you to get things
right, so that the real question, the expressiveness of the language is you know,
if you hadn't had this test infrastructure to keep running to getting it right, how
many times could you have done it?
>> Haryadi Gunawi: Oh, [inaudible] so two months before that, I borrowed my
SQL book from the library, so that shows you how we can -- what I cannot tell
that to the program committee. But that's kind of the whole point how we can
simply find things with this query.
>>: The intuition number of iterations that when you went through to check
[inaudible].
>> Haryadi Gunawi: Oh, yeah, yeah, yeah.
>>: You get the first time, did it take two ->> Haryadi Gunawi: Yeah, you take two, three iterations. But you just focus on
this very one kind of localized query, right? You do not need to handle other like
other -- it's not cluttered basically.
>>: You're 90 percent you got it right the first time?
>> Haryadi Gunawi: Depends on the cross-check. I mean, I don't have the slide,
but their cross-check that you must done with multiple instances of multiple
different structures. So for that particular cross checks it's hard. But there's
some cross-check where you can do with across some fields within one
structure. And that's simple. You can do one iteration and you get it right. So
depending on the type of the cross-check that you do.
>>: If some corruption scenario is not being handled, is it [inaudible] to figure out
which [inaudible] to fix or like there are 150 queries, so is it ->> Haryadi Gunawi: Right. So the next thing is what you can do is you can get
in the whole ideas, go back to how to you check the checker, right? It would be
nice to have a formal model of what the checker should do and compare the
model to the sequence of the query that I run.
>>: [inaudible] my mind it looks like -- I mean you have 150 queries that you see
some corruption is not being fixed. And how do you go back from that
observation to which [inaudible].
>> Haryadi Gunawi: Right. So the checker basically, the model is define it in two
phases, okay? So in the first phase, for example, you check the super block, you
check the group disk code block. So pretty much just a very minimal structures
that you check.
So if your corruption basically kind of deals with those few structures, then you
do it in the first phases. But if your corruptions relate to like a link [inaudible] or
different, more complex check usually you do that in different phases. Right? So
you must get the sense of what the phases are within today for some checkers
and if you find a question that relates to certain phase you try to look at the
queries for that particular phase. Does that answer?
>>: What's the most complicated discovery in the 150, like how many blocks of
iNotes this query has to check? Because the example you showed like a
[inaudible] right, A1, B1, B2, then you only need the [inaudible] what's the most
complicated need [inaudible] blocks or iNotes [inaudible].
>> Haryadi Gunawi: Yeah, well, that one is actually -- I mean, if you look at the C
code it's complex, but turns out if you look, the queries are not that complex.
Sometimes you do like complex cortex, like link count. But the point is that you
need the complex check is cross-check that you must do across multiple
structures and across many multiple instances of ->>: That's kind of like my SQL anyway, right? I mean [inaudible].
>>: Since he's only halfway through his talk, can we wait a little bit and let him ->> Haryadi Gunawi: Okay. In terms of flexibility, the point is that I have
improved it to FSCK [inaudible] the wrong repairs and doing repairs, so these are
the type of repairs that are -- you cannot find in today [inaudible]. But the whole
take-away point is that you can add these new repairs in a few lines of code.
Okay?
And performance, just quickly here. So we compare E2FSCK of four different file
systems with different sizes. Its file system is half full. The Y axis shows that
although runtime now [inaudible] to E2FSCK. So here's E2FSCK time in terms of
seconds, and here's SQCK time to E2FSCK. But the whole point is that SQCK
stays within the 1.5 times the original E2FSCK performance and there are many
optimizations that we can introduce.
>>: [inaudible].
>> Haryadi Gunawi: Right. So for this one, we have like 300,000 files makes
small files large files.
All right. So what do we have is this idea that problems in today false checkers
and how we can have a better world with SQCK. So I'm going to move on to the
file system part. So again I'm going to show you the problems today in file
system checkers, and I'm just a co-author for this particular paper. It was led by
[inaudible] who is a researcher at Microsoft Silicon Valley.
But I'm going to solve these problems with I/O shepherding and then we can
evaluate the I/O shepherding. Okay? So let's see the problems that we found
together with [inaudible] and other colleagues here. So here basically we
[inaudible] reliability and the way we do that is we inject read and write folds of
different block types. And when we inject these faults we run different faults on
operations. Okay. So the result that I'm going to show you is basically in today
file system we have this problem of inconsistent, incomplete scattered and
inflexible policies. And I'll show you the progression and how we can come up
with this conclusion.
Okay. So basically we want to measure the reliability of read recovery policies,
so we run a bunch of workload, open, read, write, and so on. And when we run
this workload, we inject read failures on different block types, okay? So and
basically all the gray boxes means unapplicable [inaudible] and we found four
recovery strategies in EXE3 read recovery policies.
So let's just take one simple box here. So for this particular [inaudible] what we
did is we run a [inaudible] workload like CD and the pad, and during this
[inaudible] workload we injected write failures on the indirect block. And what the
policy that we observe is that the failure is detected and is propagated to the
application, so we mark this with kind of horizontal bar that's mix propagate,
okay? But the take away point here, you can just focus on this red area. For the
same write failure on the indirect block, the EXE3 file systems take different
recovery responses. Sometimes it propagates, sometimes it tries the operation,
sometimes it ignores the failure and sometimes stops the file systems. Okay?
So that's how I can -- we can conclude that this is consistent.
And this is the right recovery policies. And just quickly you can just focus on all
these circles. It means that all write failures happening when you run this
workload and all the write failures happening on these block types are being
ignored by the file systems. So if this happens in your case, then you -- your
data or your file will be lost suddenly. Okay? So this is what I call incomplete
policies. Okay. So basically how should we deal with storage failures? Well,
there are lots of techniques out there. And it has performance and reliability
trade-off.
The point here that I want to make is sometimes what you want to do is you want
to deploy a particular set of policies. Let's say you want to add mirroring and
check stubs in the EXE3, okay? So the same principle, we can just do that, and
what I'm going to show you is that if we do that in the current framework, it's not
so easy. And the reason for that is because of scattered reliability code in EXE3
file systems. Okay? So the X axis showed the source files in the EXE3 and the
Y axis shows the line number that shows the storage failure minutes in these file
systems. So as you can see basically there are hundreds of places. So when I
want to introduce new policies in this file systems you must do it like everywhere.
So that's why I -- it's inflexible.
>>: What do you mean by live reading code?
>> Haryadi Gunawi: Well basically the file system needs to handle read failures,
write failures, right? And the file system do that in many different locations and
what the file system tries to do is handle its fault in its I/O location. So that's why
everything is scattered. Does that answer your questions? It's basically handling
read and write faults of different block types.
Okay. So we have this problem, and it's not only happening in the EXE3 but also
in [inaudible] when I talk informally to people in the storage industry they may
also mention that this million dollar property file systems also have the same
problems. But of course we cannot really say much about that. So it means that
we need a new reliability framework in dealing with storage failures. Okay? So
I'm going to show you the idea of I/O shepherding, the goals in architecture. So
the whole point with the shepherd is this. This is a nice layer underneath the file
system. It's pretty much integrated to the file systems where it locally takes care
of the reliability of the request. The file systems will extend read and write, and it
will be interpreted by the shepherd. And we'll see later how we can enter
reliability within the shepherd.
So there are three important goals. The first one is as you can see the original
policies are very much like scattered throughout the file systems code. What we
want to do is we want to localize everything within the shepherd, okay. Because
the whole idea that with localized policies, we can have more correct, less buggy
and much more simple reliability management. And probably you can do
verifications within that layer, too.
The second thing is since everything is localized, it's very easy to achieve
flexibility. Okay? We can deploy different kind of policies for different
environments or requirements. For example you might want to add mirroring for
to protect archival volume or you want to add checks to protect corruptions of
scientific data. But the whole point is that different policies for different
requirements and/or environments. Okay? And the third goal is that we can
combine one of more just basic policies to four more powerful policies. Okay.
So to achieve this [inaudible] yes?
>>: Can you [inaudible] what layer the shepherd is at? Does it think about
blocks, does it think ->> Haryadi Gunawi: Blocks. Blocks, yes. But it's pretty much integrated with the
file system because sometimes what you want to do within the shepherd is to
understand the block types within the file systems. So you need to understand
the EXE3 block types. If you want to deploy shepherding for Windows NTFS, for
example, you need to note the windows NTFS block types within the shepherd.
Yes?
>>: So how does the low level subsystem know whether it's archival or are you
assuming ->> Haryadi Gunawi: Assume file system administrators. Yes. So basically I
assume that the file system administrators who will write this kind of policies. I
mean so far when we have file system usually it's the policy that the file system
developers writes, right, and file system administrators needs to understand the
file system code in order to change the policy. But the whole point is that the
whole within the shepherding layer we allow file system administrators to
compose different policies because everything will be simple in this layer.
All right. So to achieve the goals I'm going to show -- describe a little bit about
the four architecture -- four important components in the I/O shepherd, the policy
table, policy code primitive and policy metadata. Okay? So basically we want to
build a new reliability framework in file systems so we asked the first questions,
how do we specify reliability policies? Because we do not want to introduce
another ad hoc solution. So basically I come up with these two different facts
that the first fact is that different -- usually file systems has different block types
and different block types have different levels of importance.
For example, super block might be more important than iNote block or iNote
block is more important than data block, okay? And the second fact is again
different volumes usually require different reliability levels. If you have an
archival volume, usually you want more protections than a temporary volume.
So what we need here is the ability to specify fine grain policies and we can
achieve that with the policy table. So the first benefit of policy table is we can
deploy different policies for different block types. So, for example, here if we see
a right to -- if the shepherd sees a right to the super block, we just say we want to
execute these three mirroring code with real mirroring to the super block but for
iNote or iNote with [inaudible] block we say that we want to add checksum and
parity. For data we do not want to add redundancy, so we just do a simple
reentry. Okay?
The second benefit is that we allow file system administrators to choose different
-- to instruct different policy tables for different volumes, again for full volume you
can have policy table that employs more protection than the policy table used in
a temporary volume. Okay?
So the second question in building this architecture is how we can write simple
but powerful policies. But in short basically we profiled lots of primitives that are
very usable across different policies. The whole idea is that we try to hide
complexities of handling these failures behind these primitives. And since that's
the case, then the job of the policy writer like the file system administrators are
just to compose these primitives into a fully formed policy code, okay? So the
policy code in general should be very simple.
>>: Do they [inaudible] ->> Haryadi Gunawi: C code basically. But it's just simplified. Okay? All right.
So let's put it all together. So normally if you have a request you just send this to
the storage subsystems. With shepherd we can modify the request or add more
I/Os to add the reliability. So for example if we see a write to data block D, we
can specify a policy table that will say that for data block we want to mirror the
data.
So this -- so every time you see a write to data block, it will haul this policy code,
and the policy code should be simple, it just calls a bunch of primitives. So, for
example, it will look up the mirror map that is provided by the shepherd so the
mirror -- let's say this is a new data block. There is no replica yet. So the policy
code will call another primitive to the file systems asking for a new block for the
replica block, and then it will update the mirror map such that indicating that D is
mirrored to R. And we can write the code that sends these two blocks to the
disk.
>>: This is mapped out [inaudible] within the file systems?
>> Haryadi Gunawi: The mirror map?
>>: The map allocate function? Like, are you calling each ->> Haryadi Gunawi: No, this one is calling back to the file systems because we
must always appear to ->>: So the shepherd is linked into the file system apparently.
>> Haryadi Gunawi: Yes, yes, right. [inaudible]. And the mirror map will be
outdated by a background of primitive. So the whole point that we can interpose
I/Os and we can enter reliability in a very simple way. Okay?
So in terms of implementation the shepherd infrastructures is around 3500 lines
of code and much of this code is reusable for other file systems. Some of this
code is very much integrated with the file systems so like the block types
allocations and everything. And in terms of integrations to the operating
systems, I need to modify EXE3 but so far the modification is just 900 lines of
code. But I want just to emphasize the most challenging integrations with this
integrating the I/O shepherding with this consistency management in the EXE3,
in this case, the general layer.
And my apology, I cannot show the problem because it needs 30 to 45 minutes
to establish the problem and the solution. But the whole point is that the
shepherd adds new data and metadata to add reliability. And those additional
data and metadata needs to be consistent in the presence of crisis. So when we
-- I want to integrate that to the journaling, I found that there's some major flow in
the journaling layer, so I said it cannot react to the checkpoint failures. And I can
talk about that offline. But the whole point is that due to this major flow, that's
why you see that all write failures are being ignored in EXE3 IBM DFS and FS2.
That's because of this major flow.
>>: So [inaudible] present even without your I/O shepherding [inaudible]?
>> Haryadi Gunawi: The flow -- if I want to formally define the flow is that when
we look at individual block failures, the journalling cannot handle that. The
journalling layer only handles consistency in the presence of crisis. But when it
basically the whole point of journalling, you put your blocks in the journal area
first and then you checkpoint it to the final locations. When you checkpoint this to
the final locations, the current journalling layer is that says that you cannot
change the transaction has been committed there. So if you want to do
remapping or mirroring or something like that, where you change the metadata
that has been committed to the transaction, it's too late.
So if I understand your answer, this file shows up any time you try to extend the
file system. Is that right?
>> Haryadi Gunawi: No. Even let's say that -- let's just not take I/O shepherding,
let's just you want to add remapping policy in the EXE3 file. You cannot do that.
>>: Okay.
>> Haryadi Gunawi: You cannot do that and achieve consistency in the
presence of crisis. Because of -- of this major flow.
>>: So [inaudible] pretty much analogous to logging the databases. So do you
have any idea what the commercial databases do when they see write failures
when they're trying to do installs?
>> Haryadi Gunawi: So in database they have this idea like compensating
transactions. So I mean the whole idea with database, they kind of flock like the
old data also, along with the new data. But in terms of dealing with individual
block failures, I haven't really looked into how database management systems
handle.
>>: [inaudible] compensating transactions for dealing with [inaudible] and things
that are higher level rather than immediate failure.
>> Haryadi Gunawi: Right. Yeah.
>>: You must have faced this problem because it's exactly the same problem.
>>: [inaudible] already decide the transaction ->>: If you admitted the transaction, right, you've got the director in the log and
the check pointer is running along trying to install stuff in the log.
>> Haryadi Gunawi: Right.
>>: Into the volume and -- I mean, maybe it just blows the volume. But that's
harsh.
>>: It pops up the message by a better disk.
>>: Well, I mean, yeah, we have the [inaudible] and you could delay write back
fail, right, which is not very helpful.
>> Haryadi Gunawi: I think the idea about this is that when we think about
individual block failures coming into the picture, then we look at how today's
systems try to handle that, it's not really complete in the sense ->>: [inaudible].
>> Haryadi Gunawi: Yeah, I don't know.
>>: Usually they're ahead of us, so ->> Haryadi Gunawi: Yeah. So, yeah, so one of the [inaudible] contributions
about edit this whole idea of chain transactions, and I can talk that more offline.
>>: So I have a question [inaudible].
>> Haryadi Gunawi: I'm sorry?
>>: Not [inaudible].
>> Haryadi Gunawi: Pretty much it's like -- it's simple changes. Like, for
example, the shepherd needs to call back to the file systems and giving new
block to ->>: [inaudible] how many files for this 900 lines [inaudible] was it like five lines in
every file or was it like mostly all in a couple of places.
>> Haryadi Gunawi: No, mostly localized. Like for example, allocation policies.
The reason why I need to introduce more is because the original file system code
is so messy such that I cannot reuse the original file systems code and it takes
part of the code and write this new code. Yup. And part of this is the
consistency.
>>: [inaudible] mentally add up the sizes of all the files in EXE3 that you had, I
came up with about 10,000. Is that how big it is?
>> Haryadi Gunawi: The EXE3 file system? 20,000.
>>: That's what I thought.
>> Haryadi Gunawi: Yeah.
>>: Okay. So you're still adding [inaudible].
>> Haryadi Gunawi: Yeah. Well part of this is this new solution transaction,
right, and it's just one of the thing that I must do to preserve consistency.
>>: 20,000 include a JPD [inaudible]?
>> Haryadi Gunawi: Yes. The full [inaudible]
>>: [inaudible] transactions that were [inaudible] because it seems like ->> Haryadi Gunawi: Yeah, yeah, yeah, you can. All right. So let's [inaudible]
with I/O shepherding briefly here. So I'll show you how we can achieve flexibility,
[inaudible] process and simplicity. So the whole point again, this is the original
[inaudible] policies that can be summarized in this picture. So let's say we want
to throw away all these policies and let's say we want to add a simple retry more
policies. We just install a policy table that looks like this. If you see a read to any
block type, it will call this policy code. It will read the block. And if it keeps
failing, it will retry. And if the retries keep failing, it will just returns to the failure.
Okay?
So I ran all the workload again for inject the faults again and basically as you can
see, we can have a different kind of recovery model. Okay. And we can -- show
you how we can do fine grain policies. So say we want to change the EXE3
[inaudible] policies to custom policies like a combines of different policies
depending on the block types, but the take away point again here is kinds of the
overall behavior, you can just focus on the cover. If it is a write failures under
data blocks it will do what I tell in the policy table which is try and propagate and
the same thing for general and metadata block. Okay?
And writing the policies overall simple and so far we have just -- I've just written
eight policies and the most complex one can be written in 80 lines of code.
Okay. All right. So let me just briefly describe -- yes?
>>: [inaudible] how this is philosophically different from something like software
read where I would explicitly try to do something like -- doing something like
mirroring software, right, at the block level, has been done before so ->> Haryadi Gunawi: Right. It goes back to that slide that I kind of skim a little bit
fast is that there are lots of techniques out there, and lots of techniques not about
write. Sometimes you want to do checksum and sometimes you want to do
parental checksum. Like you want to do the checksum not next to the block but
like in other different area. Right? Sometimes you want to do like ead after write
just to make sure that your writes are not lost. So basically the point is that
they're kind of different kind of policies, and the shepherd just allowed that you
deploy that kind of policies.
>>: Okay. But is one of your other [inaudible] results is a better software
architecture is to put all of these within the block level rather than scattered them
throughout the file system level?
>> Haryadi Gunawi: Yes.
>>: To move many of these policies that you found scattered throughout.
>> Haryadi Gunawi: Yes, policies that build with block level ->>: They need to go lower down? Okay. And this was not being done in
something like the EXE3 file system?
>> Haryadi Gunawi: No.
>>: So [inaudible] how is it that [inaudible] you have a set of aspect [inaudible].
Looks like you are taking [inaudible] recovery aspects of [inaudible].
>> Haryadi Gunawi: Good questions. Honestly I haven't really explored it. What
I have read is one title that there's some researchers that mention that AOP
doesn't work for failure handling and I've talked to one professor at Northwestern
who says when you do AOP, there are too many -- when we talk about full
handling and aspect oriented, there are too many cross-cutting concerns but to
be honest I don't know about all the details. But people, people that have looked
into AOP and failure handling, they say that it -- they do not real [inaudible] okay.
But there. I mean I love to talk about that later if you want. Okay?
All right. So I will not give you too many technical details. Again, this is just -- I
just have a couple of pictures that I want to show. So you can just sit back and
relax. So let me show you the first one is just static analysis of error propagation.
So the problem definition is simple. You want to do one operation so you call a
bunch of functions, these functions will call a bunch of other functions as well.
And if they are low-level failures, error code says EIO, that stands for error
related code will be propagated and this error code must be propagated as long
as the error has not been handled or as long as there's no pre-NK statement for
example. At least you should not have silent failures.
But as you can see this using calls basically ignored all the written error codes
that are written by the screen functions, okay? So this is what we call as bad
calls because failures are being ignored. So even though there are some
failures, they do thing [inaudible] access to the caller. Okay? Which is bad.
So you want to understand the magnitude of this problem? I wrote this static
analysis but here's the result in the EXE3 file systems, so all this note represents
functions, all the edges represents function calls and error code propagates
upward and all the function and function calls appear in this [inaudible] are just
cuts by the error code. So in reality there are many more function and function
calls, okay?
So again you can focus on the [inaudible] because those are the places where
failures are being ignored. So here's our [inaudible] pretty much 35 places
where failures are ignored. IBM JFS 61, NFS client 54 and behold XGFS 105
places, okay. And this is the most complex file system. Okay. So when I did the
[inaudible] analysis basically what I found, this is not current case. But there's
somewhat a pattern. For example, write failures are ignored many more than
read failures. Okay? So I believe this is another hint of this kind of reliability
design issues in the two day file systems. Maybe it's hard to recovery from
failure, maybe it's hard to role back in the middle of operations. And you can look
at the comments that I will show you later that kind of hinted all these problems.
So many questions that are still to be answered in this particular research.
>>: [inaudible] to have more [inaudible].
>> Haryadi Gunawi: Because I have silent failures. So there are failures but the
failures are not handled and it's not even printed.
>>: So but how do you know it's not just [inaudible].
>> Haryadi Gunawi: Oh, for some, I mean, for the file system research goes
back to the file system research we see that write failures are really being
ignored. When we look at the I/O operation center the file system we do not see
any more I/O so it means that write fails, too bad. Okay.
All right. So and also I've done some -- a little bit work on database and
database DBMS and data corruptions. The whole idea that the file system is not
the only system that manages data in the storage subsystems. DBMS also
stores its own internal metadata structures. So basically we injected corruptions
to my SQL on these data structures and we want to see whether my SQL could
tolerate such failures, and it turns out that the results are damaging, the server
crash. Sometimes wrong results are written to the users which leads to security
problem. Sometimes records are lost. And I think the more ridiculous part is that
my SQL also has a repair utility and this repair utility ignores the [inaudible] but
sometimes it even tries to corrupt metadata. So it also crashes when it wants to
do the repair. Okay?
So ->>: Did you try it with a [inaudible] kind of an academic tool.
>> Haryadi Gunawi: My SQL and postscripts, postscripts almost pretty much
have the same problem. Well, Jeff, no, didn't say that they [inaudible].
>>: [inaudible] commercial.
>>: You should look at DB2 because that one's been subsumed.
>> Haryadi Gunawi: Yeah. Jeff ignore to actually told me that he can run this on
DB2, but he cannot tell me the result anyway. So.
>>: [inaudible]. A license agreement might ->> Haryadi Gunawi: Okay.
>>: Again, I've done some work on the constructing kind of [inaudible] the whole
idea that I develop techniques and develop policies without looking in the single
lines of source code and there are many benefits that we can do from the results.
I've also done a little bit work on deploying that work surfaces and user reliable.
So the whole point is that there are lots of TCP note verses out there, but you
must convince the operating system developers to deploy your TCP in the kernel.
So the questions for us is how we can expose information and safe control to the
applications so that we can deploy work surfaces at the user level, okay, and I
love to talk this offline if you want.
Let me just conclude with some future work. I'm going to show you some
comments that I found in kind of problematic cases in file systems. So EXE3,
there's no way of reporting error to user space and so let's just ignore it. XFS
just ignore errors at this point. There's nothing we can do anyway except to try to
keep going. Should we pass any errors back? Yes. SCSI to do, I think we need
to handle failure. Okay. So I think the whole point is that there are many rely
built problems in today file systems, even the storage drivers. So that's why my
roadmap is about the need to measure reliability first and to build powerful
reliable file systems with simplicity in its design.
And what I've done pretty much in one sentence is to make today file systems
including file system checkers more reliable by kind of removing unnecessary
complexities in its design but without sacrificing power and performance.
So again with offline file systems checker, I've shown you how we can build -how we can simplify the constructions of file systems checkers with the use of
the [inaudible] query language and for file systems how we can simplify reliability
managements with the shepherding layer. Okay? In terms of my future work, I
have my vision is to fold -- the first one is to build highly reliable systems and I
believe there are three important requirements.
The first one I believe that failure management should be revisited. I found major
flaw in today journalling file systems. I still found flaws in error code
propagations in today file systems, and there are still many questions to be
answered. And when I talk to some systems [inaudible] admit that this -- some of
these problems that they are aware of but they -- since they just let it go, they do
not have kind of a real solution. So at least from my part, there are lots of results
that I can use as a database of problems to [inaudible] designs issues in file
systems reliability.
The second research is about simplifying systems. Because I just believe that
tomorrow's systems would be much more complex and much more larger than
today's systems. And if we keep writing this large and important software in the
lower level approach it might be hard to manage the failures. So it's kind of a
great challenge to come up with high level approach to describe how large
systems should operate. And maybe if we can come up with the good high level
approach, maybe again we can verify and formalize the reliability of these large
systems. Okay?
And the second vision is to build highly available systems and simply I believe
that reliability cannot stand alone. If you have a reliable systems but low
availability or low performance most people will not use that. And this is a lesson
I learned from the search industry is that there's one case where they need to
take off the file systems for four hours just to cross-check database of data and
the users are really mad. So there are many works that can be done in this area.
One kind of a shortened project is how we can build a fast online repair as part of
the file system itself. So removing the need of offline checker. Okay?
Let me just briefly the scope of my -- describe the scope of my future work. I love
to work with cross-system in the areas. So in my research you have seen how I
steal ideas from other areas [inaudible] and databases. Again, I'll always
[inaudible] large critical systems beyond file systems. I've done a little work on
DBMS. It will be interesting to see other systems like distributed systems, cloud
computing as well.
And again I really look forward to see my opportunity in evaluating these new
trends. Storage [inaudible] is one example. I want to look at how systems build
on top of these new drives. And it's just fascinating that today's system is not just
about one machine or one operating system. So again I also look the opportunity
to look into how I will deal with cloud computing. Okay?
So let me just conclude with this quote from Tony Horn [phonetic]. He said that
the unaffordable price of reliability [inaudible] but it didn't mean that today we
should remove the features that we have in today's systems. But so we must
accept that tomorrow's systems will have more features, it will be more larger
than today's systems. So that's why in my research I basically try to solve this
new challenge is how to build large reliable systems with simplicity in its design.
Okay? So that's it.
I'll take more questions now.
>>: What in terms of about how common partial failures are.
>> Haryadi Gunawi: Okay.
>>: And so did those measurement studies that you talked about at the
beginning talk document that partial failures are a more common than whole disk
failures?
>>: You mean intermittent failures?
>>: Like [inaudible].
>>: Well, for whatever class of partial errors is relevant to this talk, right? Either
intermittent failures or corrupted data. Like either things get fixed by retry or
things that were data corruption on the disk.
>> Haryadi Gunawi: The whole time that it happens, and it depends on the -turns out it depends on the storage sites also. If you buy a set of graphs that's
more cheaper they find that they have more partial failures than SCSI drives.
And depending on the kind of like infant more tilt if you deploy the drives within
the first or two years, sometimes you see more partial failures than if probably if
the disk has been around for three or four years.
And there's also like correlation that sometimes if you see a latent sector failures,
usually you'll find more sector failures in that disk because, I mean, in overall the
disk has just gone bad, right? So things just happen.
>>: But how does it compare to [inaudible].
>> Haryadi Gunawi: To all this ->>: [inaudible].
>> Haryadi Gunawi: Right. So I mean I can look at the numbers, but so far I
don't have the numbers on top of my head. But I mean the whole thing from the
storage industry is that when they look at these things, they need to build kind of
new stuff to deal with this partial failures.
>>: [inaudible] failures [inaudible] but not enough [inaudible] these problems are
real serious problems.
[brief talking over].
>>: I'm trying to understand like what is unique about file systems and people
have tried to make, you know, other systems simple and there's similar
approaches of uses [inaudible] query languages and calling up privacy PCs,
independent components. What do you think is -- what do you think the unique
challenges are there in the file system where there are ->> Haryadi Gunawi: The main.
>>: [inaudible] approaches.
>> Haryadi Gunawi: Most challenging part is it's state full. What you store on the
disk is non volatile. If you look compared to like network and networking has
done a great job in dealing with kind of [inaudible] and everything. I mean you
look back it's on the fly. But if you look at network, I mean I can argue that pretty
much you send packets and if it doesn't come back, you try more basically. But if
you look at file systems you started maybe you do not look that data for one
more year but one year after now you expect the data to be still there on the disk,
right? So that's I think the most kind of the unique challenge in file system
because the whole idea that they stay kind of full system.
>>: [inaudible].
>> Haryadi Gunawi: Right. So I mean agreed. So when we look at P2P then
the file systems [inaudible] comes into that picture.
>>: [inaudible] in the file system [inaudible].
>>: So you look at the cloud based storage, [inaudible] data to some service
provider and there are, you know, there are service level requirements that says
[inaudible] guarantee a certain level. So in that context, do you think file system
[inaudible] is too [inaudible] for end user? Because I can just, you know,
[inaudible] for giving you this service provider and not have to worry about this
thing.
>> Haryadi Gunawi: So let me just -- just let me know if I get this correct or not.
But even when we talk about cloud computing, one of the challenging part in
cloud computing today stated by Yahoo cloud architect and many other architect
is data management and how do you deal with failures at a large scale? And
nobody have looked into failures that are large scale as large as like cloud
computing framework. What we look is single drives failures and everything. So
when we look at large scale failures and the whole principle of reliability comes
into the picture, also, how we should deal with replication, data managements in
such a large scale failures. But from end user, I agree, they will be at the end is
kind of a certain reliability guarantee that the infrastructure profiles and the user
-- I mean usually I mean users are lazy in the sense that you do not want to add
these details to the users. So it will be -- yeah, I mean my research is about
giving more power to this infrastructure developers to build more flexible and
more powerful policies.
>>: I want to [inaudible] I still -- again, I'm also not the [inaudible] file systems but
it seems to me like that technology of the disk basically has been out for a long
time and people [inaudible] reliability 10 years ago and I don't have a sense that
this somehow less reliable today than they used to be. And my question is kind
of on the -- which is [inaudible] do you think like we're going to be working on
reliability for the disk 10 years from now. Like is this -- are we making progress
on this problem or is it -- I mean ->> Haryadi Gunawi: At the end of the story if you look at the disk, for example,
the density of the disk becomes more compact, right, you have like a very lots of
bits per square of your drive. And if you look at some articles, they mention that
when you want to build firmware on those particular drives it's very hard, it's very
complex, because you deal with very tiny, little things. So that's why that's kind
of [inaudible] comes in. When you have more complex hardware to deal with,
usually the firmware is not as perfect as what you want. That's why I mean
originally it's just anecdote but if you look at articles starting like two years ago
people say how this drive firmware can be [inaudible] our loss. And that's not just
because of the media itself but because of the firmware.
So it really depends on the technology, on the full storage subsystems, it's not
just the hardware itself.
>>: It's actually tragic because we didn't understand how disk failed until the last
five years because nobody had 100,000 disks sitting around study to see how
these [inaudible] your disk has a million lines of source code, and they just keep
stuffing more functionality to it because the chips keep getting bigger.
>>: So the answer is nobody looked 10 years ago to see if [inaudible] were
doing stuff like this, so we don't know the answer to your question, but this
anecdotal evidence of databases finding [inaudible] rights and other behaviors
that couldn't be explained by a correction -- correctly functioning stack. So
maybe the disks were broken back then and maybe it was the software and the
database and probably it was all [inaudible].
>>: It seems like everything else keeps getting more reliable, then you -- disks
will become more and more of the less reliable thing. They become the longest
pole in the tent over time. So you sort of have to keep up with the rest of the
stack.
>>: So I have a question. Which is kind of [inaudible] on some other people
asked, which is failure, lack of air handling or inappropriate handling of errors is a
common problem to [inaudible] every code base not just disk code base. What
have you learned that could be [inaudible] I mean so let's say you decide
tomorrow you're going to start looking at [inaudible] compilers or kernels or
[inaudible] what have you learned that through the use of -- or is everything you
learned -- is everything you figured out, all the tricks that you figured out, are they
specific to this?
>> Haryadi Gunawi: I guess the tricks, yes. But the principle again is very much
extended. Just take one example for cloud computing, okay? There's this work
by Jeff Walter [phonetic] from UCSD, and he mentioned if you look at the kind of
this 10 cloud computing software, if you look at the code, the surface
management part and the data management part is very cluttered. So if you
want to express different kind of policies for different kind of data management,
you cannot do that in that particular software. So what they do is they kind of
decouple those two and how do you decouple those two will be famous technical
to the problems that you want to solve. So pretty much I would say that the
principle can be extended but the details just I believe you just need to find your
own tricks, yeah.
>> Ed Nightingale: All right. Thank you.
>> Haryadi Gunawi: Thank you.
[applause]
Download