1

advertisement
1
>> Chris Hawblitzel: All right. Well, it's my pleasure to welcome Mona
Attariyan here. Mona's getting her Ph.D. at the University of Michigan.
Working under Jason Flynn. And many of you probably know Mona already from her
recent internship here. So I'll let Mona go ahead and start her talk and tell
us about troubleshooting and information flow.
>> Mona Attariyan: Okay. Thank you very much. It's a pleasure to be here.
It's great to see all these familiar faces. So today, I'm going to talk about
software configuration troubleshooting, and I'm going to tell you how they may
be used in information flow analysis to improve this problem.
So software systems are very complex.
me more features. Most importantly, I
software. So we're constantly pushing
better, and that has made our software
I want my software to run faster to give
want to be able to personalize my
our software to be faster and bigger and
fundamentally complex.
So now the problem is that when something goes wrong, the troubleshooting
becomes very difficult. So the troubleshooting, it's very time consuming.
It's very tedious. It usually requires a lot of expertise and it's also very
costly to corporations. So let's see what causes software to have problems in
the first place.
So here, I'll show you in a study that was published in 1985, it's the classic
study by gray, and he looked at the outage of -- the root causes of outages in
a commercial system. So I want to draw your attention to this big red part.
About 42% of the outages were due to administration problems and that's
basically configuration maintenance.
So 26 years later, this is another studies that was published just last year,
SOSP 2011. And again, so they looked at the root causes of severe problems
reported by the customers of a company that provides a storage systems, and
again, you can see that the biggest root cause is configuration, about 31%. So
there is many other studies during these years that actually suggest the same
idea, that misconfigurations are the dominant cause of problems in deployed
systems. And these are severe problems. These are problems that lead to
performance degradation. They leave your system to be partially or fully
unavailable.
Let me give you two more stories on the impact of misconfiguration problems.
2
Facebook went down a couple years ago, went down for about two and a half
hours, and it wasn't reachable to millions of users. So many people didn't
know how to waste their time any more. The problem turned out to be an
incorrect figuration value that got propagated.
Another story. The entire dot SE domain for country Sweden went down for about
one hour. It affected thousands of hosts and millions of users and the problem
turned out to be a DNS misconfiguration. So these kind of problems happen a
lot in software systems. And when they happen, they're very difficult to fix,
and as I mentioned, they're also very costly. So reports show that, for
instance, technical support contributes about 17% of the total cost of
ownership for today's desktop computers, and the majority of that is just
troubleshooting configuration problem. And there are many other studies that
suggest the same result.
So in the past couple of years, I've been focusing on the problem of improving
configuration troubleshooting. I broke down several different projects and
have developed several different tools. I worked on a project called AutoBash,
it was published in SOSP '07, and it basically provides a set of tools to the
users to help them fix their configuration problems more easily.
I developed a tool called SigConf, that diagnoses misconfigurations using a dug
database, and my most recent tools are ConfAid and X-ray, and they both
diagnose misconfigurations that come from configuration files. ConfAid does
that for misconfigurations that lead to failures and incorrect output and X-ray
does that for performance misconfigurations. And I'm going to talk about these
two today.
All right. So the goal of my research is to help two types of users. First is
end users who might just be having a problem with an application on their
personal computer. And the second is the administrators who might be in charge
of maintaining the system. None of these users are necessarily -- you know,
necessarily have access to the source code of the application or they might
just simply not be interested in looking at the source code, or might not have
the expertise to look at the source code.
So what happens is when they have a problem, they usually try different things
that they know, try to fix a problem on their own, and if that doesn't help,
the next step is usually to go online, look at the forums and look at the
manuals and try to see if other people have similar problems. Basically, using
3
a trial and error process, would just try a bunch of different things that
people suggested and see if that helps, if that fixes the problem. Otherwise,
they'll try something else.
I personally find this to be very frustrating and very tedious, and I think
this is why people hate computers so much, because when something goes wrong,
it's just so difficult to fix it.
So wouldn't it be great if we had a fix-it button. And what it does is that
when your program's not working, you just say fix it, and it would magically
start working. So we're not there yet. I'm not going to give you that. But
we can still do much better than we are doing right now.
So how about I give you an easy button. And what it does is that when your
program's not working, it would give you a list of potential root causes. And
what if almost all these, the first couple of root causes that it gives you are
actually accurate and are actually correct. That's exactly what ConfAid and
X-ray do. ConfAid gives you this list for misconfigurations that lead to
failures. And X-ray does that for misconfigurations that lead to performance
problems. Yes?
>>: Is this equivalent to when you type the problem into a search engine and
you get a list of root causes?
>> Mona Attariyan: Yes, we have to go through, and most of the time it's, you
know, you read this, it's completely incorrect. You try it, your system is
even worse than what it was.
>>:
--
This can be incorrect too, right?
>> Mona Attariyan:
>>:
I'll show you how we ranked them.
Okay.
>> Mona Attariyan:
>>:
Have you ranked them statistically or
It's better than Google result.
Are you going to prove that?
>> Mona Attariyan:
Practically.
4
>>:
Okay.
>> Mona Attariyan: But the thing is that, you know, the problem with searching
for a problem like that is that it's all, you know, it's all the English. You
basically come up with a description of your problem, and then hopefully
someone else described the problem the same way that you did and hopefully it
will show up. You know, most of the time you just read the forum at the end.
Nobody really came up with a solution.
So it's a lot, it's a lot more difficult in that sense. What we try to do is
to, you know, give you okay, this one and this one. Go look at these and these
and hopefully are answers.
So okay. So let me tell you a little bit about the core idea behind ConfAid
and X-ray. Here is our observation. We have a program. It reads from
configuration sources. It does something, and then it generates an outcome
that's incorrect. So the problem is like a black box so we don't know how it
uses the configuration sources to generate the outcome.
However, the application itself knows how it got there. So if we open up this
black box and if we analyze how the application is running -- yes, go ahead.
>>: Typically, programs have other dependencies besides their own
configuration. Are you going to take those into account or not?
>> Mona Attariyan:
What do you mean?
Do you mean like the input or --
>>: I mean things like DNS on the computer, for instance, or interact.
Somebody using somebody -- a program using some other libraries than those
libraries not being compatible. So it's not that program's configuration.
It's compatibility issues between programs.
>> Mona Attariyan: That's a good point. So in general, so most of that
actually goes under the configuration, the general term of configuration.
using different libraries, maybe.
Like
So that is part of the configuration. I don't specifically look at the problem
of not, you know, using the wrong [indiscernible] or not being compatible. But
in general, that falls into the category of configuration. So we just say,
5
okay, this is the configuration of the system and this is the actual input of
the system.
So these are the main two inputs that go into the system. I look specifically
at the configuration of the application, but that's definitely another source
of, I'd say, configuration again.
Okay. So if we open up this black box and we look at how the application runs,
how it uses these sources and how it generates -- uses these sources to
generate the outcome, then we might be able to infer which one of these
configuration sources are causing the outcome to be incorrect. So basically,
you know, if you analyze the program as it run, then we might be able to infer
what's going on.
So that's the main idea behind both ConfAid and X-ray. Okay. So this is the
outline of my talk. I'm going to first talk about ConfAid. I'm going to give
you some details on the algorithms that we use to do the analysis that I just
described. And I'm going to talk about some of the heuristics of use to make
it more practical, and then I'm going to switch gears and talk about X-ray and
then I'll spend some time and talk about the research directions I would like
to pursue in the future and I'll conclude.
So let's say we have an application, it reads something from an configuration
file [indiscernible] and an error happens. We would like to know what
parameter in the configuration file is causing the error to happen at the end.
So let's take a look at a very simple code. The application reads the token.
The token is equal to ExecCGI, for instance, in this case. Therefore, the
variable ExecCGI is going to be set to 1. Later on, because the variable
ExecCGI is equal to 1, the error happens. So as you can see, there are causal
relationships in execution that basically connects ExecCGI variable to the
error that's happening at the end. And ConfAid is basically interested in
these kind of causal relationships.
And we use ten tracking, which is a common technique used in security to find
these causal relationships. So here's what we do. Whenever a token, for
instance, here ExecCGI, is read into the application, we assign a specific
taint or mark to that. And then as the application runs, we use data flow and
control flow to propagate this taint. When we get to the error, we can use
this taint to link it back to the configuration token that caused it.
6
So the goal here is to avoid the error that you're seeing at the end and also
not lead to new errors. So the goal is to find a successful path. So here is
a simple example.
We have an application. It has three if conditionals. Each of them is
dependent on a configuration parameter. And we have an error that's happening
at the end. We would like to know which one of these configuration parameters,
blue or red or green, can be the root cause of the problem that you're seeing
at the end.
So the blue option cannot be the root cause. But even if you change that, if
you manage to get the application to take the other path, it wouldn't still
merge before the error. So we get to the same error at the end. So it cannot
be the root cause. The red option cannot be the root cause either, because if
you change it and you get the application to go the other way, you would avoid
the original error, but then you would lead to new errors on the other path.
So the red option cannot be the root cause either.
The green option, however, can potentially be the root cause, because if you
change it, you would avoid the original error, and then it seems to be
successfully continuing and not leading to new errors. And that's exactly what
ConfAid returns. A list of root causes that, if changed, a list of
configurations options that, if changed, would avoid the original error and
wouldn't themselves lead to new errors.
All right. So now I'm going to talk about the algorithms that we use for taint
tracking. Yes, please?
>>: Question. You said green might be the root cause, but wouldn't be like a
combination of red plus green that would lead to the path?
>> Mona Attariyan: Yeah, so this is a very simple case. For instance, let's
say if you have an option here that's dependent on both of them, of course.
But this is a very simple case where you're assuming that this one is only
dependent on green. Of course, you can have cases where it's green and red and
we can tell you that. But this is just very simplified.
>>:
But you [indiscernible] to go right on the branch, right, in the second?
7
>> Mona Attariyan:
>>:
Yeah.
If it goes left, then you wouldn't trigger it, right?
>> Mona Attariyan: Yeah, the idea is if you change red and it goes left, then
it's not good, because you'd see a new problem. And that's not what you want.
So if you change red, then you won't see that error that you were seeing, but
you see a new error, and that's not good either.
>>: [indiscernible] SA3 values and there's a third branch that doesn't lead to
the error.
>> Mona Attariyan:
consider that.
>>:
Okay.
>> Mona Attariyan:
>>:
So if it's possible that it doesn't go here, then we
This is a case where we know it would go there.
Okay.
>> Mona Attariyan: Okay. So I'm going to talk about the algorithms that we
use for taint tracking now. So before I get to the details, I'm going to talk
about why we decided to do taint tracking. So information flow analysis in
general can be implemented via multiple different ways. You can do it
statically. You can do it dynamically. You can use symbolic execution, just
to name a few. Why did we decide to go with taint tracking?
So we had several design principles in mind that kind of led us so this
decision. First of all, we thought that a practical tool cannot rely on the
source code of the application simply because for many of the applications that
we use every day, we don't have the source code. So it has to rely only on the
binary.
The second point is that a practical tool has to be able to analyze complex
applications. Because these are the kind of applications that we usually have
problems with so we have to be able to analyze complex data structures, has to
support multithreaded inter-process communication and things like that.
And the third point is that we need to have a reasonable performance.
So the
8
good thing about troubleshooting is that we are competing against humans so we
don't have to be extremely fast. You probably won't mind waiting a minute or
two for the troubleshooting to finish, but you probably do mind waiting 20
hours. You might as well just go ahead and Google, as Ed said, and find the
answer. So we need to have reasonable performance.
Other implementations of information flow analysis that we had at the time fall
short in meeting at least one of these criterias. So we decided to go with
taint tracking. And taint tracking, as I mentioned, is actually pretty popular
in security.
So here I want to suggest that it actually might be a better suit for
troubleshooting problem compared to security. And here are a couple of
reasons.
First of all, our environment is not adversarial. The developer of the
application, unlike the hacker, does not have an incentive to bypass our
system. At worse, they're going to be agnostic to your system. And at best,
they're going to write the program in such a way that lends itself better to
the type of heuristics that we use.
And also, I think that performance is probably less of an issue for us because,
you know, again, a couple minutes might be okay for troubleshooting. But if
you have a to wait a couple minutes every time you want to load the web page,
that's probably a problem.
So for these reasons, I think taint tracking might be a better fit for our
problem compared to security. Security people, if they object, they can -okay.
So let's get to details
set of variable X. And
changed, the value of X
different configuration
here. So let me introduce a notation. TFX is a taint
it includes all the configuration tokens, that if
might change. And I used color triangles here to show
tokens.
Taint propagates via data flow and control flow. Here's a simple example of
data flow. We have X equals Y plus Z. The taints of Y is red and blue
configuration tokens and the taints of Z is green and blue. So, of course, if
any of these tokens change, potentially the value of X might change as well.
So the taints of X is going to have all of them, the union of the two sets.
9
Taint also propagates via control flow. And many of the systems that implement
taint tracking actually ignore control flow because it's expensive and it's
more difficult. However, we realize that for our purposes it's actually pretty
essential because most of our taint gets propagate the via control flow.
So here's a simple example. We have, you
here's C. And we would like to know what
different at the end. So, of course, the
A is different, and that's via data flow,
know, a condition that's tainted,
could cause the value of X to be
value of X could be different because
as I just explained.
The value of X could be different because of the value of C. Because if you
change C, you might get the program to not run this, and then therefore, the
value of X could be different.
There's another subtle way that we could change the value of X, and that's by
changing the previous value of X and at the same time changing C to make the
application to take the S part or this other part.
But note that both of these need to change at the same time to give us a
different value for X. So let me introduce to you our first heuristic.
ConfAid currently does not follow joints configuration -- joint root causes.
Basically, it won't tell you that these two need to change at the same time.
It will tell you that these are both potential root causes, but it won't tell
you that you have to change them at the same time so we're basically not
following this final term.
>>:
By which you mean that will just be a blue triangle?
>> Mona Attariyan: No, we would just not have that. We basically say that,
okay, either value of X is going to change because of A or we would change C
and the value of X is going to be what it was before. So if the actual root
cause is the blue triangle what you do is that first you'll see red and green.
You change green and then you'd see blue in the next round, because you would
then run this and then the blue would show up in the next round.
So you might have to run your application multiple times to see all the
possible root causes. It really depends on the structure of the program,
though.
10
Okay. There is another subtle way that control flow can propagate via taint
and that's via the code that doesn't run. So let's take a look at this
example. Here C is tainted and the application takes [indiscernible]. The
value of Y is technically dependent on the value of C, though. Because if the
value of C is different, then the application could potentially take the else
part and that would change the value of Y.
ConfAid is interested in finding these kind of dependences as well. However,
as I explained, ConfAid does the analysis as the application runs, and the
application doesn't do the else part, so how do we find out about these kind of
dependencies?
So here's what we do. When we run the application, if we see a conditional
that's tainted, we take the checkpoint, we flip the conditional, we make the
application artificially go the other way, we run it and you find out about
assignments like Y equal A on the -- we call this on the alternate path. When
it merges, we basically roll back everything we did, restore the checkpoint,
and we continue.
So let me tell you about our second heuristic. We only run the alternate both
up to a certain threshold, and that's basically to prevent ConfAid to be stuck
in a really long alternate path. So what happens is that ConfAid runs the
alternate path. If it hits the maximum number of instructions, it will just
say okay, I couldn't see the merge point. I've run enough. I'm just going to
roll back and continue.
So this may cause false positives and false negatives, but it has a big
performance gains for us. So we decided to do this.
All right. So usually at this point, people ask, well, how about false
positives? Do we get a lot of false positives. Once you see a case that you
see that the error otherwise dependent on all configuration options. The
answer is actually yes, we did see something like that. And the problem was
that we basically treated all kinds of taints, propagations, equally. So data
flow was basically equal to control flow, and also we treated taint like a
binary value. A variable as either tainted by an option or not tainted by an
option.
And we realize that that's not actually sufficient. For instance, we saw that
the conditionals that are closer to the error are usually much more relevant to
11
the error compared to the conditionals that are very far from the error. And
also, we saw that data flow as you introduce it is a stronger dependency
compared to control flow. However, we couldn't capture this with the regular
taint analysis.
So we introduced our next heuristic, which we call it weighting heuristic. And
what it does is that it assigns weights to the taint as they propagate and
these weights are basically numerical values that indicate how ConfAid thinks
that taint is strong.
So the way that we assign these taints is the conditional that's closer to
error get to propagate a bigger weight compared to the conditional that's
farther and the taint that's coming from a data flow is going to have a bigger
weight compared to the one that's coming from a control flow.
So now, with these weights, ConfAid's able to actually rank the root cause for
you. That's how we get the ranked list. The configuration values that get
higher rates become ranked first, and then it goes from there. So that's how
we get the rank the list. Yes?
>>: So now that there are false positives, how do you deal with tainting
pointers.
>> Mona Attariyan:
What do you mean, like a tainted pointer to like a --
>>: [indiscernible] but not the contact is tainted. So there are two ways to
deal with this, right. So for some [indiscernible] looks up stuff, if index is
tainted, then, you know, although the content is not tainted for some
applications you treat that content, the taint should propagate. For some
other cases, that would cause a lot of false positives so you don't want to do
that.
>> Mona Attariyan: Currently, if we do. If the address is tainted, we take
the taint. Currently, that's how we do it.
>>:
You take the taint?
>> Mona Attariyan: We do take the taint. So we basically say if your address
is tainted, whatever you're taking is going to have that taint as well. So,
for instance, if, let's say, you are looking at, you know, you're traversing
12
NRA and the index is tainted, that's going to taint whatever you read.
>>: Right, but if the base of the table is tainted, right, address, righters
then you may cause a lot of false positives, right?
>> Mona Attariyan: Yes, but if you don't do that -- if you don't do that, it
causes false negatives. So what we did was that we basically felt that, okay,
if we can deal with the false positive part, we better do that than have a list
that does not contain the actual root cause. So we actually had cases where we
didn't do that and we saw false negatives so we just decided to do that.
It causes false positives, but the good thing is that if you're doing the
weighting, it might just get fade away, and that mean it just disappears.
>>: But in that case, does the weight help to reduce false positives caused by
putting this address ->> Mona Attariyan:
>>:
The taint from the address?
Right.
>> Mona Attariyan: So we did it, and then we did the weighting so I'm not
quite sure which one of the cases, if we didn't have the weight, would cause
that false positive, because of the address. I don't know a specifically which
one of the cases would be worse. But the weight, if we don't have the
weighting, we're going to have a lot of false positives in general.
Okay. So the analysis that I just described is actually pretty expensive. The
slowdown in the order of two, you know, tours of magnitude slowdown. So it's
actually pretty expensive analysis. That might be okay for, you know, the
application maybe if you're just running your application on a desktop and
every now and then you need to troubleshoot. But it's not okay if you want to
troubleshoot maybe a server in a production environment.
And also, sometimes the symptoms of the problem are time dependent. So if you
are perturbing the timing a lot, you might not see the symptoms again.
Symptoms might change. And also, we are kind of relying here on the user to
reproduce the problem for us. So see a problem, now you want to analyze it.
You reproduce the problem and then we would analyze it for you.
13
However, some of problems, especially performance problems, are really hard to
produce. You might not be able to, you know, right away create it again. To
address all of these problems, we decided to develop a very lightweight
deterministic record and replay system and we run the applications all on top
of this. This is all internal, and the deterministic record and replay system
basically, what it does is that when the application is running, it records all
the non-deterministic events. For instance, return value system calls, signals
and all the timing information. And it records all of that in a log and later
we use this log to recreate the exact same execution and then we run all the
analysis on the replay of the execution.
So basically, get rid of all this overhead on the online system.
So as I mentioned, we use it a lot, we replay the execution later and then we
run the heavy analysis on the replay.
I'm not going to go into too much details on this system, but the main
difference between our system and other deterministic record and replay project
out there is in the fidelity of our replay.
So the fidelity of our replay needs to be strict enough to create the same
execution as the record. However, because we do analysis in the record, we
basically instrument the replay and then we do an analysis inside. Our
fidelity should be loose enough to allow this extra code to run within it. And
we achieve this via a careful code design. Our replay system is
instrumentation [indiscernible]. It can differentiate between the replay code
and the analysis code that we are running.
So I'm not going to talk too much about this.
offline about this if you're interested.
I'd be happy to talk to you
All right. So let me show you some results. So we used ConfAid to
troubleshoot three applications. OpenSSH, Apache web server and Postfix mail
server. We looked online, looked at the for rums and manuals and found 18
misconfigurations that people reported for these three applications. We
recreated these and then we ran them in ConfAid to see if it could find out the
correct root cause.
And ConfAid was very successful. It could correctly find the correct root
cause, rank it first or second in all of these cases. And these are the total
number of configuration options that were available in the configuration files.
14
So in 72% of the cases, the correct root cause was ranked first. In five
cases, it was ranked second, and we never ranked it worse than second. Yes?
>>: Would you talk a little bit more about the variety? You know, are these
shell configuration bugs that are deeply nested in configuration files? Or
does that make a difference on the complexity for ConfAid?
>> Mona Attariyan: It does make a difference. I'm going to show you another
set of evaluation too right after this. These are mostly very deep
configuration problems. Where people actually tried a couple different things.
They couldn't figure out what it was and they actually posted in a forum,
waited a couple days.
So actually, when you look at it in the code, it's actually pretty deep.
usually takes a while.
It
I have another data set I'm going to show you after this that creates more
shallow cases. And definitely for the shallow ones, it's easier and, you know,
the results are actually going to show that as well.
And there was another question?
>>:
So if you're a user using ConfAid, how do you specify the failure point?
>> Mona Attariyan: Oh, that's a very good question. So there are different
types of failure. There are some failures that are obvious like, you know, a
search or crash or something like that. There are failures that are not
obvious and we're relying on the user to tell us. For instance, you see a
message and you just don't like it. You say, you tell us that this is an
error. Or you just, it might not even be messaged like that. It might, you
know, for instance, run Apache and get a file from Apache and just tell us that
this is wrong.
So we basically rely on the user to tell us what is wrong and what is right.
Right now, we have a very simple way of doing that. So the user tells us
whatever you're telling me you're printing something, with this message on the
screen, or you're giving me something with this content over the network, this
is wrong.
>>:
But, I mean, it may not be easy to use because you assume user doesn't
15
look at the source code and doesn't know the source code.
>> Mona Attariyan: You don't really require the source code for that, because
you only see the outcome. So you have a way of figuring out that something is
wrong. So you see a message, you say okay, this is an error to me. You see a
content that seems wrong to you.
So you just, you don't need to know about the source code. Just see what is,
you know, the application is printing or giving you. So if you just specify
that to us, that would be sufficient.
>>: [indiscernible] relying on the user is like multiple root causes might
generate the same user visible error.
>> Mona Attariyan:
>>:
Sure.
Sure.
Then how does ConfAid know right from wrong --
>> Mona Attariyan: So we are looking at that specific execution that caused
that message. So, of course, there could be other ways that could lead to the
same problem. But we are analyzing data-specific execution that happened. See
what I'm saying? So there might be multiple ways to get there. We're not
analyzing those. We're only analyzing the actual execution that you saw that
led to that error.
So we are recreating the error. We are using the record and replay to record
the exact same error and we are analyzing that execution path and then we see
which one of the options are affecting that execution path.
Does that answer your question?
>>: I think so.
fine.
>> Mona Attariyan:
I think there are some gaps in understanding, but that's
Okay.
Yes?
>>: When you were building the algorithm that prioritizes which configuration
settings, what was your training set like? Do you have bugs from these three
applications, or were they from different applications?
16
>> Mona Attariyan: We did not have bugs from -- so we tried OpenSSH first, and
we saw that there's a lot of false positives. Seems there are a lot of false
positives. And then we saw that most of the time, the conditional desk closer
is relevant and sometimes we're reporting something that's very far. So we
realized that that might be something that they should be looking at. And then
we added that to the code and then we ran Apache and Postfix and later
[indiscernible] and they all seemed to be, seemed to be pretty good afterwards.
>>:
So you used these bugs.
Did you use in term?
>> Mona Attariyan: I used OpenSSH. So at first, did the OpenSSH. And then it
ran fine for a couple of the bugs and then for a couple more, we saw a lot of
false positives. And then we decided to fix the false positive problem. And
then we introduced the weighting heuristic. But then afterwards ->>:
So you're training and testing on the same bugs?
>> Mona Attariyan: There's not much training. It just gave us the idea that
maybe we should have a way of specifying which conditionals are more important,
which taints are more important. You're not really, you're not using any
statistical method to figure out that. We basically have this simple heuristic
that says the conditional that's closer is more important.
>>:
But you came one a heuristic based on these bugs?
>> Mona Attariyan:
>>:
Based on the OpenSSH files.
Okay.
>> Mona Attariyan: Yes. And then we used that and we ran Apache and Postfix
and they both ran great and then we ran the other set that I'm going to show
you after this and they ran great.
>>:
So you didn't change your heuristics at all after OpenSSH?
>> Mona Attariyan:
>>:
No.
You didn't touch anything after Apache or Postfix?
>> Mona Attariyan:
No.
And then we also did X-ray and that was fine too.
17
>>:
Okay.
>>: So behind the [inaudible] therefore INS is completed, which has happened
to me. Does that show up as a token here, as a system call failure, or how
does that show up in your system?
>> Mona Attariyan: I think it's permissions. Is it kind of, you know,
something closed maybe to Unix five permissions?
>>:
[indiscernible] create files [indiscernible].
>> Mona Attariyan: So here, I'm looking at the parameters in the configuration
file so it won't show as a problem that I'm looking at, but this technique can
be extended to also follow those kind of configuration values as well. So it
can be easily extended to actually we're doing it, extending it to also ->>: [indiscernible] log windows machine, because the permissions check is in
the kernel. It's not in my IS process.
>> Mona Attariyan: Let me try to understand. So you're trying to read
something and it says that you don't have the permission, right? That's pretty
much enough for us, because we say that, okay, permission here. So for
instance, for Unix, you know, you perform a system call. It gives you a code
that says permission denied, something like that. So we just use that and say,
okay, permission was wrong for this file. So we don't really need to go into
kernel and see how exactly it does it. But then, you just use the result.
>>: At this point, you didn't have the information flow from the actual
permission setting on the slide through to the failure of the call?
>> Mona Attariyan: No. We have it -- we just see the end of it that says,
okay, you don't have permission. We certainly don't follow kernel as well.
When we have a system call, we don't go into the kernel.
>>:
Thank you.
>> Mona Attariyan:
>>:
Andrew?
So the object of this analysis is the name of the code figuration variable
18
that is incorrect. But do you know what the right value is, or do you know how
to fix the problem?
>> Mona Attariyan: We don't. So we basically tell you that these are the
options that are most likely causing your problem. We don't tell you how to
fix it.
>>:
So is it easy, in these cases, like was it easy to fix the problem?
>> Mona Attariyan: So here's the thing. Here's why we don't tell you how to
fix it. Sometimes changing that option is not necessarily the right fix. For
instance, you see that, okay, I can't access this because of authentication
problem. You don't want to remove authentication. You don't want to change
authentication, right.
So you basically suggest to the user, this is causing your problem. It's up to
the user to decide whether they want to add something, whether they want to
change the value. Of course you can change it and run the tool again to see if
there are any new root causes. But we don't necessarily make that decision of
whether we should change it or not. Andrew.
>>: I guess another way [indiscernible] is that potentially, the configuration
of the system is not just a configuration file, but the permissions on the file
and ->> Mona Attariyan:
>>:
That's correct.
How does this scale when the size of the configuration set has to be huge?
>> Mona Attariyan: That's a very good question. You know, configuration in
general is, you know, actually pretty fuzzy. It can be a lot of things. As I
mentioned earlier, you know, your library is any file, you know, in your
system. The variables, environment variables, all of these are considered to
be configuration. And also, configuration file itself can be maybe extremely
huge.
We didn't really try, you know, what we tried was like in order of hundred
configuration tokens. But I can imagine systems that might have thousands.
How does it scale? I can't tell you for sure, because I didn't do it, but
there is overhead in terms of the amount of memory that we use, first of all,
19
and also as you're running it, we are, the way that we're doing it, we're
copying taint for these configuration options over and over in memory. So as
your stake gets larger and larger, you need to do more when you're doing the
analysis. Does that make sense?
>>: I guess my concern is that if everything in the system is configuration,
which it potentially is, then everything is not getting tainted.
>> Mona Attariyan: So your application, everything in the system can be
configuration, but your application might not read everything in the system.
>>: Everything [indiscernible] reads is sort of totally externally
[indiscernible] application potential sources of configuration.
>> Mona Attariyan: Sure. So your application might start with reading a lot
of things from the system. The good thing is that it doesn't use all of that
to go down a certain path. It might use all of that to go down all the paths
in its lifetime, but you're only looking at one single execution path, and it
doesn't use all of that for making decisions for one single execution path.
That's the good thing about our system is that when we are writing the code, we
are not using all of these little pieces. It's actually, there are actually
studies that shows that usually, there's only one or two options that are
causing a problem. It's not like ten million different options that are
causing your problem.
>>: So a follow-up question. So how big is your [indiscernible].
like eight bits or 60 bits or 32 bits?
So is it
>> Mona Attariyan: So right now, we have one byte per configuration options if
you're tracking. And so it gets a little bit too much detail. But the way
that we do it is that shows us the weight. So we have, you know, that -- so we
have that much maximal that much weight. So each configuration option gets
some bytes and if you increase the number of configuration options for each
variable, this is going to increase. But each configuration option gets one
byte. So a variable is going if you have like 50 different configuration
options, a variable is going to have 50 bytes. One each for each configuration
option.
>>:
Is there a shadow memory that gives [indiscernible].
20
>> Mona Attariyan: Yes. So the good thing is that not all your memory becomes
dependent on configuration options. So, for instance, you run Apache and at
the end maybe like 70K of it was dependent on configuration option. So you
kind of, there's a big overhead in terms of what we keep. But the good thing
is that not all of your memory needs to have that much overhead.
So now, I can imagine cases where all of a sudden you have giant pieces of
memory become dependent on a lot of taint. So we need to kind of -- we haven't
seen that case yet, but we need to -- what we need to do is kind of maybe get
rid of some of this taint, maybe make it smaller. Fade away some of the taint,
maybe make it compromise to keep more. But there's a memory overhead.
>>: So you have a lookup table for each variable that points you to this
taint?
>> Mona Attariyan: Yes. It's a three-level lookup table, kind of like a page
table type of searcher. Yes?
>>: Do you use [indiscernible] or banner instrumentation to run the taint
analysis.
>> Mona Attariyan: We do it dynamically if that's what you mean. We use
binary instrumentation and we add all the analysis as the binary runs.
>>: So it feels like this could be very useful for the developers. Actually
another approach is to use [indiscernible] project is this kind of approach to
do phasing and to maybe focus on the configuration nodes and maybe those
[indiscernible].
>> Mona Attariyan: Definitely. It can definitely be useful for developers,
although we try to kind of -- we try to not use the source code so it's also
useful for end users and others. But if you have a source code, it's actually
going to be much easier. So it can also be used by developers as well. Yes?
>>: So I was curious about the need to execute multiple times. So if there
are dependencies, you said you were going to maybe change some parameters?
>> Mona Attariyan: So let's say you have a -- let's say you have a problem
where you need to change two things to get it fixed. Depending on the
structure of the application, we might give you both of them. It really
21
depends on how the application checks for them. We might give you both of
them, like first and second. And then you fix the first one. We don't tell
you you have to change them at the same time. You fix the first one. It
doesn't go away. You run again and you see the second one coming up and you
fix the second one.
There are cases where we might miss the second one. And you see the first one,
you change that, you run again, and then you see the second one. So we may or
may not show you all configuration. It really depends on how the application
checks for this.
So for instance, if you change this and now you go to the alternate path, we
kind of, when we were exploring the alternate path, we maybe aborted early, we
didn't see the second one. Then we might miss it like that.
>>: So I wonder if you're going to sort of allow multiple runs in your
experiment and observe the outcomes, then are there other approaches so you can
-- and it's sort of analogous to sage about you don't need symbolic execution.
I mean, you have your input file or your configuration file and you have a
hypothesis that some byte is the cause so you change that to some other value
and then you run again and you observe and you do things like code coverage or
[indiscernible], things that are very cheap approximations to taint. But
instead of tracking taint, what you say is, hm, most likely, if there's a
change to this one byte and I look at the code coverage before and after, the
differences in the code coverage are only, if everything else is deterministic,
right, then if that's the only change I've made, then if I see these
differences in the code coverage, those are likely very related so the change.
>> Mona Attariyan: So I think that approach works very well for phasing
testing because you change something and you see it. Here, we don't know what
to change, necessarily. You have hundred different options, which one are you
going to change. That's the problem here.
What we are trying to give you is that we tell you, okay, these three, maybe,
are the most important ones. So maybe you want to go use something like that,
like change them a little bit and see how it goes. But it narrows down what
you need to look at a whole lot. Simply by ->>:
So you're saying it's sort of a complementary?
22
>> Mona Attariyan: Yeah, once you
you want to fix your problem. You
You want to automatically fix it.
bit by bit and see where it goes.
do that, then you might be -- so let's say
say that, okay, this option is my problem.
Now maybe you can go change it, you know,
Yes?
>>: Follow-up question. So since you know the kind of the failure, have you
done kind of backwards slicing to see what options are in that code to see
whether your thing is now -- can narrow that much farther than the simple
backwards slicing?
>> Mona Attariyan: So we didn't do backwards slicing. The main reason is that
backwards slicing for really long execution path is not very successful. So
here, we have cases where you usually happens that you read configuration at
the very beginning. You run a long path. Sometimes you go through processes,
and then you get to the error.
For Postfix, for example, there are five processes before you get to the error.
So the configuration is in this process. The error is in other process.
Backwards slicing cannot really go that far and kind of has problem in going up
for really long executions.
So that's one of the reasons that we decided to not do program slicing in the
first place.
>>:
[indiscernible] very interesting [indiscernible] that's my question.
>> Mona Attariyan: Okay. So this is the other configuration that James wanted
to see. So yeah, so we use the tool, it's called ConfError. It's developed at
EPFL. And what it does is it randomly generates bugs in configuration files
that looks like human errors. Why do you laugh, Andrew?
>>:
Do you need a tool for that?
>> Mona Attariyan: I developed a tool. It's actually pretty good for testing.
If you want to see your application, if it fails horribly, if your
configuration value is wrong or it dies gracefully, that's the tool that you
use. So it was very useful for us, because then we generated 60 bugs using it
and we didn't change any of our heuristics, as Stuart asked, and we ran ConfAid
again. And in 55 cases, ConfAid was able to rank the correct root cause first
or second.
23
So again, 85%, it was actually ranked first. And 7%, it was ranked second.
And there were five cases where we didn't rank well. Worse than second. So
three days cases -- yes, Ed?
>>:
Go ahead and finish.
>> Mona Attariyan: In three cases in Postfix, the correct root cause was a
missing configuration option, and that's something that ConfAid doesn't
currently support. So you had to add a new configuration option to fix the
problem. So that was the three Postfix ones. The Apache ones, right correct
configuration option was ranked ninth and that was the direct result of a
rating heuristic. And the OpenSSH one didn't finish. We needed some more
support from our system call so it didn't complete. Yes?
>>:
[inaudible].
>> Mona Attariyan: So quickly, I'm going
For the real world bugs, the average time
troubleshooting to finish. Again, I want
replay execution and not online. And for
23 seconds.
to show you some performance results.
is one minute and 32 seconds for the
to emphasize that this runs on the
the randomly generated ones, it took
Going back to Ed's question, these turned out to be shallower bugs, you know.
For instance, you had, you know, a configuration option that only accepted one
to ten, and you gave it 12. Of course, you know, right away it failed.
There were ones that were kind of deeper, but usually they were kind of easier.
The real world bugs turned out to be much harder.
Okay. So kind of final note on ConfAid. People usually ask, you know, so why
ConfAid is successful. And here is my thought on it. So usually, the
configuration problems that we see, usually once you find the root cause, it's
kind of obvious. Most of the time, there is one or two configuration options
that are causing the problem and, of course, we can have a case where you have
20 different configuration options causing a problem, but that's very rare.
There are actually studies that are published that actually support this. That
usually one or two configuration options are causing a problem.
So yes?
24
>>: Quickly, this seems like a big claim. I wonder if you actually talked to
administrators or something to ask [indiscernible] and then they could actually
go and fix it?
>> Mona Attariyan: So you're asking whether it's the answers are correct or
whether the one or two configuration option problem is?
>>:
The output is useful.
>> Mona Attariyan: Oh, okay. So I think the way that we evaluated the output
use or not is just by looking at whether it was the correct configuration
option that was causing the problem. Whether you go and change it and then
that would fix the problem is a different question, I think.
So we recreated the problems and we saw that, okay, this is telling me that
this is your root cause and this is the correct root cause that it's telling
me. We actually use it a couple times for our problems as well. So we found
it useful. We didn't ask any administrators to use it, though. The problem
mostly was they didn't want to run our deterministic record and replay in their
kernel. So we need to convince them to do that.
But I think it is useful. We found it in all the cases that we tried, we found
it to be able to narrow down the options a lot. We found that very useful.
>>: I think the output, I mean [indiscernible] like say the output up here is
very different to somebody who is building the tool to somebody who is striking
that [indiscernible] versus somebody who is just about edging and running that
program.
>> Mona Attariyan:
Correct.
>>: So it would be kind of interesting to see if you take this output to
admins and show they the symptoms and the output, see if they actually can fix
it.
>> Mona Attariyan: We actually feel that our tool is probably most useful for
people who may not have written the code, may not be familiar with the code and
they're just using it. Because what it gives you is a very general, like is
very high level result. It's not going tell you that variable X. It's going
25
to tell you that this configuration option that you actually have access to and
you can change.
>>:
There's a missing part in the argument where you haven't closed the loop.
>> Mona Attariyan:
>>:
Whether it's useful in --
Yeah, for admin --
>> Mona Attariyan: Any system, if you can give it to people and they come back
and tell you that we used it and it was great, of course it's going to be
awesome. We unfortunately didn't have the time to do that. We thought that
within our group, we used it and found it interesting. We are making it
available, actually, the source code and it would be interesting to see if
people find it useful as well.
>>: I have a comment. I think the user -- the proven user runs on desktop is
more complicated than the server program, like OpenSSH. And would they have -particularly on Windows, right. If I just open a program, there's a large
number of [indiscernible] should be accessed and many files we opened and they
involve many DLL.
>>:
Do you think [indiscernible] is more complicated than [indiscernible].
>>:
She evaluated, OpenSSH, the three --
>> Mona Attariyan: Have you seen Postfix?
applications on desktop ->>:
So let me reformulate my question.
So I believe that there are many
How large is the taint source?
>> Mona Attariyan: How large is the taint source?
configuration file?
>>:
How big is the
Yes.
>> Mona Attariyan:
>>: Hundreds.
they read.
In order of hundreds of configuration tokens.
I mean look at a user program and look at how many registries
26
>> Mona Attariyan: Sure. I don't think that is necessarily going to, you
know, translate into bad results. Of course, if you have much larger set, it
might. It certainly affects performance. It would be interesting to see if
it's going to result in, like, worse output too. I don't necessarily think
that it translates directly into worse results and more false positives.
I do believe that some of the server applications are pretty complex. Postfix
is a nightmare. We also did PostgreSQL, obviously it's a database that also
was pretty complex too. So we didn't just try simple applications.
Yes?
>>: There's actually [indiscernible] outsources IT, hiring someone to fix your
computer over the web. Just thinking you might want to be able to look at this
in context and say you've got X. He's going to have to go through
[indiscernible] today. Maybe this would cut down the time per call.
>> Mona Attariyan: Sure, that would
finish my thought here, so finding a
needle in a haystack. There's a lot
find it, it's obvious. And the good
at finding needles in haystacks, and
be so successful.
be interesting. All right. So just to
root cause is basically like finding a
of work that you need to do. But once you
news is that computers are actually good
I think that's why ConfAid turned out to
All right. So moving on quickly, I'm going to talk about X-ray. So far, I
talked about configuration problems that lead to incorrect output. And there
is another big category of configuration problems that cause performance
issues. And don't necessarily cause incorrect outcomes. And X-ray deals with
those kind of misconfigurations. So what do you do when you have a performance
problem? Usually, people use monitoring tools. Profilers, tracing, logging,
to see what's going on in the system.
The problem with all these tools is that they tell you what events are
happening in your system. What you really want to know is why those events are
happening in your system so now you need to manually infer why and that's the
part that needs a lot of expertise.
So wouldn't it be great if you could automatically infer why as well? And I
mean, it would be even greater if you could have it ranked list of root causes.
27
I see a smile.
And that's exactly what X-ray tries to do.
So X-ray currently analyzes latency, CPU, disk and network. You can use X-ray
to analyze at the granularity of one single request. For instance, for
applications like servers that handle requests. Or you can analyze over a time
interval. And X-ray also gives you this powerful tool where you can analyze
two or multiple different requests that you think they should have similar
performance, but they don't.
So here are a bunch of questions that you can ask X-ray. For instance, you can
say I have a server. Why is this request being handled so slowly. Or why is
CPU usage over time over this time interval? Or have these two different
requests, I think they should be similar, why are they different?
So let's talk about the idea of X-ray. So we call it performance
summarization. So at ConfAid, as I explained, we were basically interested in
finding out why a certain piece of code, for instance, an error, ran. This bad
red block of code, why did that run? In X-ray, the problem is we don't know
where this red block is, but nothing really prevents us from treating the
entire code like red blocks of code and determining why all events in the code
ran. That's exactly what we do.
So from a really high level, this is how X-ray works. We assign a cost,
basically, a performance cost, to different events of the execution. Those are
instructions and system costs. And then we determine, using a ConfAid-like
analysis, why each of those ran. And then we associate this performance cost
to the root causes that we just determined and then we aggregate over the
entire execution and then we rank the results.
So I'm going to use one of the examples that I told you in the last slide. I'm
going to walk you through X-ray, tell you exactly how it works. So let's say
we have a server. It's handling requests and one of the requests is
particularly slow. We want to know why.
So first step, as I mentioned, X-ray analyzes execution. But here, we are
interested in the execution that's related to that single request. Not the
entire execution. And that is not always a straightforward. Sometimes you
have an applications that use multiple processes to handle requests. You know,
the request comes in, it runs for a while in one process. It then goes to
another process and it continues. We are basically interested in all these
28
blue pieces, all of these are relevant to our request.
That's exactly what X-ray does. As the request travels between processes, it
collects all these executions that are relevant and then once it's done, it
basically says okay, these are all of the execution pieces within all of these
process that I care about. So once we have that, then we do the cost
assignment.
As I mentioned, we assign a cost to the events, and the events are instructions
and system calls. Here, we want to see why a certain request is slow so we
want to look at latency. And the latency is basically for system calls is
execution time of the system call that we collect online as part of the record.
And for right instructions, we do, we approximate the execution time of each
instruction and the [indiscernible] to extra instruction. Yes, Andrew?
>>: [indiscernible] takes a very long time because the contact switch is
around somebody else. How does this play into this?
>> Mona Attariyan: Good point. If you analyze that single request, it might
be misleading, because that single request wasn't running. It was just
sitting. What you want to do is look at a time interval, because that would
include other processes that were actually running at the same time.
So here is the point. We give you a bunch of different tools and then you can,
you know, you're running this on replay so you can do it multiple times with
different types of analysis. You can analyze a request. You can analyze over
a time interval. You can do different things to figure out what's going on in
the system.
This is something that we kind of rely on you as the admin to figure out.
basically use the tools the best you can do.
To
Okay. And yes. So the timings are all collected online. So the analysis is
not going to perturb the timings. And then once we have the cost, so, for
instance, let's say we have like a small block of code. We assign ten
microseconds maybe to it and then we have a long block of code, maybe it has a
really long system call and it took the cost of hundred microseconds.
Then we determine why each of those ran. For instance, very simple case, maybe
the first one ran because of configuration option A. The second one ran
29
because of B. We assign the cost to the root causes, and then we aggregate
over the entire execution and we rank.
What does this mean? That tells us that X-ray thinks that B is a bigger
contributor to the performance of the problem compared to A. So if you're the
admin, you want to see why it was slow, go look at B, because that's causing a
larger performance cost for you. And then go look at A. So maybe you can't
remove B, but this tells you why it was happening.
Okay. So as I mentioned, X-ray also gives you this powerful tool where you can
compare two different requests and see why the performance of these between
requests are different from each other. We call it differential performance
summarization. Here's how it works. We have two requests. We extract the
execution pieces of both of them, the way that I mentioned. And then we
compare them and find the points where the execution diverges. We call them
divergence points.
And then what we do is that we calculate the cost for each part of this
execution and the difference is the cost of the divergence point. And then we
basically do the same thing. We find out why the divergence point happened.
If it's like, I don't know, maybe if conditional, one of them took the if part
and one of them took the else part and that is the difference between the cost,
we assign the cost to the root cause and then we do that for all the divergence
points.
Finally, we give you the list. It tells you that A is the biggest contributor
to the divergence between the two requests. Not necessarily the performance of
each one, but the divergence between the two.
So now, you might ask, okay, I have thousands of requests. How do I know which
two? It's a hard thing do. So we decided to do that for multiple requests as
well so you can tell us, okay, have hundred of requests. Tell me why these are
having different performances, what is causing different performances and what
is the cost.
So what we do is that we kind of compare all of them. We find the shortest
path from the beginning to the end and note that the shortest path is not
necessarily a single request. It's going to be a combination of some of the
requests. And then we find all of the divergences from the shortest path, we
determine the cause, we determine the root causes and we basically give you a
30
visual kind of explanation of these are the divergence points.
cause, yes.
These are the
>>: Divergence points, are you assuming that [indiscernible] in a control
file? Because you have B previously on the slide.
>> Mona Attariyan:
Good point.
Give me what do you have in mind?
>>: For example, here's a request that comes in that would apply. A file
named A feed the depo, but it's a man. File named B, feed the depo, but it's
an iron mountain in Utah.
>> Mona Attariyan: Yeah, yeah. So there are cases, especially when you go to
the system call part, where because we don't follow that part, where the input
to the system call can cause divergence outside that. That we currently don't
follow, but we should. We have that in the future kind of direction.
But because we don't follow -- if we were following the kernel, we would see
that eventually as a divergence and control.
>>: There could be, like, you know, network is congested.
taking longer. Would you capture that?
That's why it's
>> Mona Attariyan: We are trying to look at the configuration reasons. Of
course, we can have, you know, my disk is slow because my disk is broken, or my
network is slow. So what we are kind of expecting here is that you as the
admin kind of look at different potential problems. My hardware not being
correct, my network being slow or I'm having a configuration problem.
The good thing is that finding out that, for instance, you know, my hardware is
broken or my network is congested for some reason, there are many good tools
that allow you to, you know, explore that and find out about that. We try to
focus on the configuration file where we thought that there are not that many
good tools.
>>: Do you compare, like there's a common performance profiling tool is like
this is a function of being [indiscernible] on 20 result time and other
function. So what's the advantage, this advantage what your tool can provide
compared to that kind of performance profiling tool?
31
>> Mona Attariyan: That's a great question. Our tool gives you a much higher
level idea of what it can do. So if I'm a user. If I'm using an application
and you tell me that this function is called a lot of times, I can't do
anything. If I'm not a developer, if I'm not looking at the source code,
telling me that this function is specifically like running a lot is not helping
me in the overall, you know, solution of the problem.
What we are trying to give you here is that, okay, this option that you can go
and change is causing you trouble. See what I'm saying? So if you're the
developer, that might be a good thing because then you can go to that option,
that function, and then do something. But if you're just using it, giving a
low level detail of what is going on, it's going to be useless to the user.
>>:
I'll wait to see your evaluation.
>> Mona Attariyan: Sure. That goes right here. So this is actually a work in
progress. We're actually still doing some more evaluation, but this is the
preliminary result. We did Apache, Postfix and PostgreSQL. We found 14 test
cases of performance problems that people found online and, you know, reported.
And we recreated them and we ran X-ray. And in 12 cases, the first option that
X-ray returned was actually of the biggest contributor to the performance
problem. In two cases, it was the third option that it returned.
Yes, Andrew?
>>:
Give me an example of what a test case is?
>> Mona Attariyan: Sure. Okay. So let me give you an example of maybe the
PostgreSQL. So PostgreSQL, for instance, it has a write log that as it does
the transactions, it writes into a log and then it later commits.
So now, if you have it -- and then it basically does a snapshot or checkpoints
of the log so that if you crash, you can come back. So if your system is under
a lot of load and you do a lot of checkpoints, it's going to have -- it's going
to put even more load on your disk.
So the problem that that person described was that I'm having, you know, my
disk is under a lot of load and then people suggested that, okay, good look at
how frequently you're doing your checkpoints. And the person came back and
said, okay, maybe I'm doing checkpoints too often. Or something like that.
32
>>: In the case of a thing like that, you would say here's one request that
went normally. Here's one request that went really slowly because it had to
make a checkpoint while it was doing the request?
>> Mona Attariyan: Okay. So let me first mention that in these 14 cases, we
have some of them that we did pair request analysis, some of them we did time
interval analysis. Some of them we did comparison. For the one that I just
described, we did the time interval analysis, where we said, okay, we look at
this, this minute to this minute, and then we saw that there's a lot of disk
usage, and then we say the option of checkpoint interval that is in the
configuration file PostgreSQL is causing a lot of that disk usage.
Now, we had, for Postfix and Apache, we had cases where we look at requests
specifically. For instance, for Apache, there was this request that was
specifically long and then we figured out that it was doing, you know, extra
DNS lookups.
>>:
How does the user interact with this information?
>> Mona Attariyan:
Like if you run --
The result, or how does it do the annual analysis?
>>: If you run your tool over a time interval and something is going to
[indiscernible] disk access, what do I as the user give to the system or what
does it give back to you?
>> Mona Attariyan: So you tell the system, I want to look at disk over this
time interval. And the system gives you ->>:
This is disk access.
>> Mona Attariyan: So as a user, you'd say okay, I seem to -- my disk seems to
have a lot of load. My network seems to have a lot of load. So the thing that
is you detect a problem as a user first and then you tell us what you want to
look at. You can, of course, do, you know, okay, over this time interval, tell
me about my disk, tell me about my network, tell me about my latency. You can
do that as well. The good thing is you can run multiple times on the replay
and you're all fine.
But as the user, you need to first detect a problem and then try to diagnose
33
the problem.
that.
>>:
The part is that we don't tell you that, okay, your disk is doing
You root caused it back to a configuration file?
>> Mona Attariyan: Yes. It might be that your disk is broken. Then, you
know, you're not really looking at a configuration problem. But if you are,
then it tells you.
Okay. So we have a few minutes. I'm going talk about some of the future
directions that I would like to pursue. So with software systems becoming more
and more complex, the very difficulty of software reliability in
troubleshooting seems to just be getting more challenging. And I believe that
software reliability is going to be one of the most important research topics
in the future, and I very much would like to pursue a couple of different
directions in this field.
More specifically, I like the problem of troubleshooting software that runs in
larger scale and also troubleshooting software that runs on a platform with
limited hardware resources.
So large scale analysis. Today, we have software that runs on scales larger
than ever. We have very complicated distributed systems and troubleshooting is
specifically difficult on these environments. Even before you get to the
diagnosis and solution, you need to detect, as I was explaining to the previous
question, Andrew, you need to detect that the problem exists. And detecting
abnormality is not very straightforward in these cases.
Usually, today, it's left to the admin and then how they do it is that they
basically look at the logs and try to see if they find any abnormality. And
this is really difficult, because they keep the log to the minimum. So the
question I'd like to answer is, is it possible to automatically find these kind
of abnormalities in the system. And once you find them, maybe you can collect
more diagnostic information and then you can do a better troubleshooting
analysis in the future.
I also would like to look at troubleshooting for software that runs on a
platform that has limited hardware. So mobile computing is greater than ever.
We have smart appliances everywhere. These platforms run very complex
applications. But there are still very limited in terms of computational
34
resources and in terms of energy and battery life. So when we're designing
solutions for troubleshooting for these kind of environments, we should take
into consideration all these constraints that they have.
So, for instance, is it possible to maybe offload some of this troubleshooting
to the cloud in a safe, secure and efficient manner so that we're not using
that much of the precious resources that we have on the platform.
And also, for desktop computers. We have done a good job of making our
application more user friendly, but when it comes to troubleshooting, we still
have a long way to go. And for desktop computers, any impact that we have on
the troubleshooting is going to be huge, simply because of the number of people
who are going to be affected by that.
I have a bunch of ideas what we can do for making troubleshooting easier on
desktop computers. I'm going to share with you two of them. The first one is
that the configuration state is usually shared. For instance, you have Windows
registry, things like that. And when you configure one of the applications,
that means that you might be breaking another application.
So is it possible to detect perhaps this thing that you're doing is going to
break something else and then the let the user know so they're aware of the
consequences, the actions that they're taking.
Another idea is that, you know, usually when you're configuring your something
new, a new feature, you might need to change multiple things and modify
multiple different things. What people usually do is they do half of it and
then there's a problem at the end. Is it possible to automatically figure out
what are all the possible configuration options that you need to change at the
same time and then tell the user so they can configure the system correctly.
All right. So conclusion. So problems unfortunately are inevitable in complex
software systems. I showed you that misconfigurations are dominant cause of
problems in deployed systems these days. And I showed you that execution
analysis can greatly improve diagnosis of these kind of problems.
I talked about ConfAid and X-ray. They both use dynamic information flow
analysis to do this, and I show you that they can be actually pretty
successful. And that concludes my talk, and I'd be happy to take more
questions.
35
>>:
I think we have time for one more question.
All right.
Download