1

1 >> Chris Hawblitzel: All right. Well, it's my pleasure to welcome Mona Attariyan here. Mona's getting her Ph.D. at the University of Michigan. Working under Jason Flynn. And many of you probably know Mona already from her recent internship here. So I'll let Mona go ahead and start her talk and tell us about troubleshooting and information flow. >> Mona Attariyan: Okay. Thank you very much. It's a pleasure to be here. It's great to see all these familiar faces. So today, I'm going to talk about software configuration troubleshooting, and I'm going to tell you how they may be used in information flow analysis to improve this problem. So software systems are very complex. me more features. Most importantly, I software. So we're constantly pushing better, and that has made our software I want my software to run faster to give want to be able to personalize my our software to be faster and bigger and fundamentally complex. So now the problem is that when something goes wrong, the troubleshooting becomes very difficult. So the troubleshooting, it's very time consuming. It's very tedious. It usually requires a lot of expertise and it's also very costly to corporations. So let's see what causes software to have problems in the first place. So here, I'll show you in a study that was published in 1985, it's the classic study by gray, and he looked at the outage of -- the root causes of outages in a commercial system. So I want to draw your attention to this big red part. About 42% of the outages were due to administration problems and that's basically configuration maintenance. So 26 years later, this is another studies that was published just last year, SOSP 2011. And again, so they looked at the root causes of severe problems reported by the customers of a company that provides a storage systems, and again, you can see that the biggest root cause is configuration, about 31%. So there is many other studies during these years that actually suggest the same idea, that misconfigurations are the dominant cause of problems in deployed systems. And these are severe problems. These are problems that lead to performance degradation. They leave your system to be partially or fully unavailable. Let me give you two more stories on the impact of misconfiguration problems. 2 Facebook went down a couple years ago, went down for about two and a half hours, and it wasn't reachable to millions of users. So many people didn't know how to waste their time any more. The problem turned out to be an incorrect figuration value that got propagated. Another story. The entire dot SE domain for country Sweden went down for about one hour. It affected thousands of hosts and millions of users and the problem turned out to be a DNS misconfiguration. So these kind of problems happen a lot in software systems. And when they happen, they're very difficult to fix, and as I mentioned, they're also very costly. So reports show that, for instance, technical support contributes about 17% of the total cost of ownership for today's desktop computers, and the majority of that is just troubleshooting configuration problem. And there are many other studies that suggest the same result. So in the past couple of years, I've been focusing on the problem of improving configuration troubleshooting. I broke down several different projects and have developed several different tools. I worked on a project called AutoBash, it was published in SOSP '07, and it basically provides a set of tools to the users to help them fix their configuration problems more easily. I developed a tool called SigConf, that diagnoses misconfigurations using a dug database, and my most recent tools are ConfAid and X-ray, and they both diagnose misconfigurations that come from configuration files. ConfAid does that for misconfigurations that lead to failures and incorrect output and X-ray does that for performance misconfigurations. And I'm going to talk about these two today. All right. So the goal of my research is to help two types of users. First is end users who might just be having a problem with an application on their personal computer. And the second is the administrators who might be in charge of maintaining the system. None of these users are necessarily -- you know, necessarily have access to the source code of the application or they might just simply not be interested in looking at the source code, or might not have the expertise to look at the source code. So what happens is when they have a problem, they usually try different things that they know, try to fix a problem on their own, and if that doesn't help, the next step is usually to go online, look at the forums and look at the manuals and try to see if other people have similar problems. Basically, using 3 a trial and error process, would just try a bunch of different things that people suggested and see if that helps, if that fixes the problem. Otherwise, they'll try something else. I personally find this to be very frustrating and very tedious, and I think this is why people hate computers so much, because when something goes wrong, it's just so difficult to fix it. So wouldn't it be great if we had a fix-it button. And what it does is that when your program's not working, you just say fix it, and it would magically start working. So we're not there yet. I'm not going to give you that. But we can still do much better than we are doing right now. So how about I give you an easy button. And what it does is that when your program's not working, it would give you a list of potential root causes. And what if almost all these, the first couple of root causes that it gives you are actually accurate and are actually correct. That's exactly what ConfAid and X-ray do. ConfAid gives you this list for misconfigurations that lead to failures. And X-ray does that for misconfigurations that lead to performance problems. Yes? >>: Is this equivalent to when you type the problem into a search engine and you get a list of root causes? >> Mona Attariyan: Yes, we have to go through, and most of the time it's, you know, you read this, it's completely incorrect. You try it, your system is even worse than what it was. >>: -- This can be incorrect too, right? >> Mona Attariyan: >>: I'll show you how we ranked them. Okay. >> Mona Attariyan: >>: Have you ranked them statistically or It's better than Google result. Are you going to prove that? >> Mona Attariyan: Practically. 4 >>: Okay. >> Mona Attariyan: But the thing is that, you know, the problem with searching for a problem like that is that it's all, you know, it's all the English. You basically come up with a description of your problem, and then hopefully someone else described the problem the same way that you did and hopefully it will show up. You know, most of the time you just read the forum at the end. Nobody really came up with a solution. So it's a lot, it's a lot more difficult in that sense. What we try to do is to, you know, give you okay, this one and this one. Go look at these and these and hopefully are answers. So okay. So let me tell you a little bit about the core idea behind ConfAid and X-ray. Here is our observation. We have a program. It reads from configuration sources. It does something, and then it generates an outcome that's incorrect. So the problem is like a black box so we don't know how it uses the configuration sources to generate the outcome. However, the application itself knows how it got there. So if we open up this black box and if we analyze how the application is running -- yes, go ahead. >>: Typically, programs have other dependencies besides their own configuration. Are you going to take those into account or not? >> Mona Attariyan: What do you mean? Do you mean like the input or -- >>: I mean things like DNS on the computer, for instance, or interact. Somebody using somebody -- a program using some other libraries than those libraries not being compatible. So it's not that program's configuration. It's compatibility issues between programs. >> Mona Attariyan: That's a good point. So in general, so most of that actually goes under the configuration, the general term of configuration. using different libraries, maybe. Like So that is part of the configuration. I don't specifically look at the problem of not, you know, using the wrong [indiscernible] or not being compatible. But in general, that falls into the category of configuration. So we just say, 5 okay, this is the configuration of the system and this is the actual input of the system. So these are the main two inputs that go into the system. I look specifically at the configuration of the application, but that's definitely another source of, I'd say, configuration again. Okay. So if we open up this black box and we look at how the application runs, how it uses these sources and how it generates -- uses these sources to generate the outcome, then we might be able to infer which one of these configuration sources are causing the outcome to be incorrect. So basically, you know, if you analyze the program as it run, then we might be able to infer what's going on. So that's the main idea behind both ConfAid and X-ray. Okay. So this is the outline of my talk. I'm going to first talk about ConfAid. I'm going to give you some details on the algorithms that we use to do the analysis that I just described. And I'm going to talk about some of the heuristics of use to make it more practical, and then I'm going to switch gears and talk about X-ray and then I'll spend some time and talk about the research directions I would like to pursue in the future and I'll conclude. So let's say we have an application, it reads something from an configuration file [indiscernible] and an error happens. We would like to know what parameter in the configuration file is causing the error to happen at the end. So let's take a look at a very simple code. The application reads the token. The token is equal to ExecCGI, for instance, in this case. Therefore, the variable ExecCGI is going to be set to 1. Later on, because the variable ExecCGI is equal to 1, the error happens. So as you can see, there are causal relationships in execution that basically connects ExecCGI variable to the error that's happening at the end. And ConfAid is basically interested in these kind of causal relationships. And we use ten tracking, which is a common technique used in security to find these causal relationships. So here's what we do. Whenever a token, for instance, here ExecCGI, is read into the application, we assign a specific taint or mark to that. And then as the application runs, we use data flow and control flow to propagate this taint. When we get to the error, we can use this taint to link it back to the configuration token that caused it. 6 So the goal here is to avoid the error that you're seeing at the end and also not lead to new errors. So the goal is to find a successful path. So here is a simple example. We have an application. It has three if conditionals. Each of them is dependent on a configuration parameter. And we have an error that's happening at the end. We would like to know which one of these configuration parameters, blue or red or green, can be the root cause of the problem that you're seeing at the end. So the blue option cannot be the root cause. But even if you change that, if you manage to get the application to take the other path, it wouldn't still merge before the error. So we get to the same error at the end. So it cannot be the root cause. The red option cannot be the root cause either, because if you change it and you get the application to go the other way, you would avoid the original error, but then you would lead to new errors on the other path. So the red option cannot be the root cause either. The green option, however, can potentially be the root cause, because if you change it, you would avoid the original error, and then it seems to be successfully continuing and not leading to new errors. And that's exactly what ConfAid returns. A list of root causes that, if changed, a list of configurations options that, if changed, would avoid the original error and wouldn't themselves lead to new errors. All right. So now I'm going to talk about the algorithms that we use for taint tracking. Yes, please? >>: Question. You said green might be the root cause, but wouldn't be like a combination of red plus green that would lead to the path? >> Mona Attariyan: Yeah, so this is a very simple case. For instance, let's say if you have an option here that's dependent on both of them, of course. But this is a very simple case where you're assuming that this one is only dependent on green. Of course, you can have cases where it's green and red and we can tell you that. But this is just very simplified. >>: But you [indiscernible] to go right on the branch, right, in the second? 7 >> Mona Attariyan: >>: Yeah. If it goes left, then you wouldn't trigger it, right? >> Mona Attariyan: Yeah, the idea is if you change red and it goes left, then it's not good, because you'd see a new problem. And that's not what you want. So if you change red, then you won't see that error that you were seeing, but you see a new error, and that's not good either. >>: [indiscernible] SA3 values and there's a third branch that doesn't lead to the error. >> Mona Attariyan: consider that. >>: Okay. >> Mona Attariyan: >>: So if it's possible that it doesn't go here, then we This is a case where we know it would go there. Okay. >> Mona Attariyan: Okay. So I'm going to talk about the algorithms that we use for taint tracking now. So before I get to the details, I'm going to talk about why we decided to do taint tracking. So information flow analysis in general can be implemented via multiple different ways. You can do it statically. You can do it dynamically. You can use symbolic execution, just to name a few. Why did we decide to go with taint tracking? So we had several design principles in mind that kind of led us so this decision. First of all, we thought that a practical tool cannot rely on the source code of the application simply because for many of the applications that we use every day, we don't have the source code. So it has to rely only on the binary. The second point is that a practical tool has to be able to analyze complex applications. Because these are the kind of applications that we usually have problems with so we have to be able to analyze complex data structures, has to support multithreaded inter-process communication and things like that. And the third point is that we need to have a reasonable performance. So the 8 good thing about troubleshooting is that we are competing against humans so we don't have to be extremely fast. You probably won't mind waiting a minute or two for the troubleshooting to finish, but you probably do mind waiting 20 hours. You might as well just go ahead and Google, as Ed said, and find the answer. So we need to have reasonable performance. Other implementations of information flow analysis that we had at the time fall short in meeting at least one of these criterias. So we decided to go with taint tracking. And taint tracking, as I mentioned, is actually pretty popular in security. So here I want to suggest that it actually might be a better suit for troubleshooting problem compared to security. And here are a couple of reasons. First of all, our environment is not adversarial. The developer of the application, unlike the hacker, does not have an incentive to bypass our system. At worse, they're going to be agnostic to your system. And at best, they're going to write the program in such a way that lends itself better to the type of heuristics that we use. And also, I think that performance is probably less of an issue for us because, you know, again, a couple minutes might be okay for troubleshooting. But if you have a to wait a couple minutes every time you want to load the web page, that's probably a problem. So for these reasons, I think taint tracking might be a better fit for our problem compared to security. Security people, if they object, they can -okay. So let's get to details set of variable X. And changed, the value of X different configuration here. So let me introduce a notation. TFX is a taint it includes all the configuration tokens, that if might change. And I used color triangles here to show tokens. Taint propagates via data flow and control flow. Here's a simple example of data flow. We have X equals Y plus Z. The taints of Y is red and blue configuration tokens and the taints of Z is green and blue. So, of course, if any of these tokens change, potentially the value of X might change as well. So the taints of X is going to have all of them, the union of the two sets. 9 Taint also propagates via control flow. And many of the systems that implement taint tracking actually ignore control flow because it's expensive and it's more difficult. However, we realize that for our purposes it's actually pretty essential because most of our taint gets propagate the via control flow. So here's a simple example. We have, you here's C. And we would like to know what different at the end. So, of course, the A is different, and that's via data flow, know, a condition that's tainted, could cause the value of X to be value of X could be different because as I just explained. The value of X could be different because of the value of C. Because if you change C, you might get the program to not run this, and then therefore, the value of X could be different. There's another subtle way that we could change the value of X, and that's by changing the previous value of X and at the same time changing C to make the application to take the S part or this other part. But note that both of these need to change at the same time to give us a different value for X. So let me introduce to you our first heuristic. ConfAid currently does not follow joints configuration -- joint root causes. Basically, it won't tell you that these two need to change at the same time. It will tell you that these are both potential root causes, but it won't tell you that you have to change them at the same time so we're basically not following this final term. >>: By which you mean that will just be a blue triangle? >> Mona Attariyan: No, we would just not have that. We basically say that, okay, either value of X is going to change because of A or we would change C and the value of X is going to be what it was before. So if the actual root cause is the blue triangle what you do is that first you'll see red and green. You change green and then you'd see blue in the next round, because you would then run this and then the blue would show up in the next round. So you might have to run your application multiple times to see all the possible root causes. It really depends on the structure of the program, though. 10 Okay. There is another subtle way that control flow can propagate via taint and that's via the code that doesn't run. So let's take a look at this example. Here C is tainted and the application takes [indiscernible]. The value of Y is technically dependent on the value of C, though. Because if the value of C is different, then the application could potentially take the else part and that would change the value of Y. ConfAid is interested in finding these kind of dependences as well. However, as I explained, ConfAid does the analysis as the application runs, and the application doesn't do the else part, so how do we find out about these kind of dependencies? So here's what we do. When we run the application, if we see a conditional that's tainted, we take the checkpoint, we flip the conditional, we make the application artificially go the other way, we run it and you find out about assignments like Y equal A on the -- we call this on the alternate path. When it merges, we basically roll back everything we did, restore the checkpoint, and we continue. So let me tell you about our second heuristic. We only run the alternate both up to a certain threshold, and that's basically to prevent ConfAid to be stuck in a really long alternate path. So what happens is that ConfAid runs the alternate path. If it hits the maximum number of instructions, it will just say okay, I couldn't see the merge point. I've run enough. I'm just going to roll back and continue. So this may cause false positives and false negatives, but it has a big performance gains for us. So we decided to do this. All right. So usually at this point, people ask, well, how about false positives? Do we get a lot of false positives. Once you see a case that you see that the error otherwise dependent on all configuration options. The answer is actually yes, we did see something like that. And the problem was that we basically treated all kinds of taints, propagations, equally. So data flow was basically equal to control flow, and also we treated taint like a binary value. A variable as either tainted by an option or not tainted by an option. And we realize that that's not actually sufficient. For instance, we saw that the conditionals that are closer to the error are usually much more relevant to 11 the error compared to the conditionals that are very far from the error. And also, we saw that data flow as you introduce it is a stronger dependency compared to control flow. However, we couldn't capture this with the regular taint analysis. So we introduced our next heuristic, which we call it weighting heuristic. And what it does is that it assigns weights to the taint as they propagate and these weights are basically numerical values that indicate how ConfAid thinks that taint is strong. So the way that we assign these taints is the conditional that's closer to error get to propagate a bigger weight compared to the conditional that's farther and the taint that's coming from a data flow is going to have a bigger weight compared to the one that's coming from a control flow. So now, with these weights, ConfAid's able to actually rank the root cause for you. That's how we get the ranked list. The configuration values that get higher rates become ranked first, and then it goes from there. So that's how we get the rank the list. Yes? >>: So now that there are false positives, how do you deal with tainting pointers. >> Mona Attariyan: What do you mean, like a tainted pointer to like a -- >>: [indiscernible] but not the contact is tainted. So there are two ways to deal with this, right. So for some [indiscernible] looks up stuff, if index is tainted, then, you know, although the content is not tainted for some applications you treat that content, the taint should propagate. For some other cases, that would cause a lot of false positives so you don't want to do that. >> Mona Attariyan: Currently, if we do. If the address is tainted, we take the taint. Currently, that's how we do it. >>: You take the taint? >> Mona Attariyan: We do take the taint. So we basically say if your address is tainted, whatever you're taking is going to have that taint as well. So, for instance, if, let's say, you are looking at, you know, you're traversing 12 NRA and the index is tainted, that's going to taint whatever you read. >>: Right, but if the base of the table is tainted, right, address, righters then you may cause a lot of false positives, right? >> Mona Attariyan: Yes, but if you don't do that -- if you don't do that, it causes false negatives. So what we did was that we basically felt that, okay, if we can deal with the false positive part, we better do that than have a list that does not contain the actual root cause. So we actually had cases where we didn't do that and we saw false negatives so we just decided to do that. It causes false positives, but the good thing is that if you're doing the weighting, it might just get fade away, and that mean it just disappears. >>: But in that case, does the weight help to reduce false positives caused by putting this address ->> Mona Attariyan: >>: The taint from the address? Right. >> Mona Attariyan: So we did it, and then we did the weighting so I'm not quite sure which one of the cases, if we didn't have the weight, would cause that false positive, because of the address. I don't know a specifically which one of the cases would be worse. But the weight, if we don't have the weighting, we're going to have a lot of false positives in general. Okay. So the analysis that I just described is actually pretty expensive. The slowdown in the order of two, you know, tours of magnitude slowdown. So it's actually pretty expensive analysis. That might be okay for, you know, the application maybe if you're just running your application on a desktop and every now and then you need to troubleshoot. But it's not okay if you want to troubleshoot maybe a server in a production environment. And also, sometimes the symptoms of the problem are time dependent. So if you are perturbing the timing a lot, you might not see the symptoms again. Symptoms might change. And also, we are kind of relying here on the user to reproduce the problem for us. So see a problem, now you want to analyze it. You reproduce the problem and then we would analyze it for you. 13 However, some of problems, especially performance problems, are really hard to produce. You might not be able to, you know, right away create it again. To address all of these problems, we decided to develop a very lightweight deterministic record and replay system and we run the applications all on top of this. This is all internal, and the deterministic record and replay system basically, what it does is that when the application is running, it records all the non-deterministic events. For instance, return value system calls, signals and all the timing information. And it records all of that in a log and later we use this log to recreate the exact same execution and then we run all the analysis on the replay of the execution. So basically, get rid of all this overhead on the online system. So as I mentioned, we use it a lot, we replay the execution later and then we run the heavy analysis on the replay. I'm not going to go into too much details on this system, but the main difference between our system and other deterministic record and replay project out there is in the fidelity of our replay. So the fidelity of our replay needs to be strict enough to create the same execution as the record. However, because we do analysis in the record, we basically instrument the replay and then we do an analysis inside. Our fidelity should be loose enough to allow this extra code to run within it. And we achieve this via a careful code design. Our replay system is instrumentation [indiscernible]. It can differentiate between the replay code and the analysis code that we are running. So I'm not going to talk too much about this. offline about this if you're interested. I'd be happy to talk to you All right. So let me show you some results. So we used ConfAid to troubleshoot three applications. OpenSSH, Apache web server and Postfix mail server. We looked online, looked at the for rums and manuals and found 18 misconfigurations that people reported for these three applications. We recreated these and then we ran them in ConfAid to see if it could find out the correct root cause. And ConfAid was very successful. It could correctly find the correct root cause, rank it first or second in all of these cases. And these are the total number of configuration options that were available in the configuration files. 14 So in 72% of the cases, the correct root cause was ranked first. In five cases, it was ranked second, and we never ranked it worse than second. Yes? >>: Would you talk a little bit more about the variety? You know, are these shell configuration bugs that are deeply nested in configuration files? Or does that make a difference on the complexity for ConfAid? >> Mona Attariyan: It does make a difference. I'm going to show you another set of evaluation too right after this. These are mostly very deep configuration problems. Where people actually tried a couple different things. They couldn't figure out what it was and they actually posted in a forum, waited a couple days. So actually, when you look at it in the code, it's actually pretty deep. usually takes a while. It I have another data set I'm going to show you after this that creates more shallow cases. And definitely for the shallow ones, it's easier and, you know, the results are actually going to show that as well. And there was another question? >>: So if you're a user using ConfAid, how do you specify the failure point? >> Mona Attariyan: Oh, that's a very good question. So there are different types of failure. There are some failures that are obvious like, you know, a search or crash or something like that. There are failures that are not obvious and we're relying on the user to tell us. For instance, you see a message and you just don't like it. You say, you tell us that this is an error. Or you just, it might not even be messaged like that. It might, you know, for instance, run Apache and get a file from Apache and just tell us that this is wrong. So we basically rely on the user to tell us what is wrong and what is right. Right now, we have a very simple way of doing that. So the user tells us whatever you're telling me you're printing something, with this message on the screen, or you're giving me something with this content over the network, this is wrong. >>: But, I mean, it may not be easy to use because you assume user doesn't 15 look at the source code and doesn't know the source code. >> Mona Attariyan: You don't really require the source code for that, because you only see the outcome. So you have a way of figuring out that something is wrong. So you see a message, you say okay, this is an error to me. You see a content that seems wrong to you. So you just, you don't need to know about the source code. Just see what is, you know, the application is printing or giving you. So if you just specify that to us, that would be sufficient. >>: [indiscernible] relying on the user is like multiple root causes might generate the same user visible error. >> Mona Attariyan: >>: Sure. Sure. Then how does ConfAid know right from wrong -- >> Mona Attariyan: So we are looking at that specific execution that caused that message. So, of course, there could be other ways that could lead to the same problem. But we are analyzing data-specific execution that happened. See what I'm saying? So there might be multiple ways to get there. We're not analyzing those. We're only analyzing the actual execution that you saw that led to that error. So we are recreating the error. We are using the record and replay to record the exact same error and we are analyzing that execution path and then we see which one of the options are affecting that execution path. Does that answer your question? >>: I think so. fine. >> Mona Attariyan: I think there are some gaps in understanding, but that's Okay. Yes? >>: When you were building the algorithm that prioritizes which configuration settings, what was your training set like? Do you have bugs from these three applications, or were they from different applications? 16 >> Mona Attariyan: We did not have bugs from -- so we tried OpenSSH first, and we saw that there's a lot of false positives. Seems there are a lot of false positives. And then we saw that most of the time, the conditional desk closer is relevant and sometimes we're reporting something that's very far. So we realized that that might be something that they should be looking at. And then we added that to the code and then we ran Apache and Postfix and later [indiscernible] and they all seemed to be, seemed to be pretty good afterwards. >>: So you used these bugs. Did you use in term? >> Mona Attariyan: I used OpenSSH. So at first, did the OpenSSH. And then it ran fine for a couple of the bugs and then for a couple more, we saw a lot of false positives. And then we decided to fix the false positive problem. And then we introduced the weighting heuristic. But then afterwards ->>: So you're training and testing on the same bugs? >> Mona Attariyan: There's not much training. It just gave us the idea that maybe we should have a way of specifying which conditionals are more important, which taints are more important. You're not really, you're not using any statistical method to figure out that. We basically have this simple heuristic that says the conditional that's closer is more important. >>: But you came one a heuristic based on these bugs? >> Mona Attariyan: >>: Based on the OpenSSH files. Okay. >> Mona Attariyan: Yes. And then we used that and we ran Apache and Postfix and they both ran great and then we ran the other set that I'm going to show you after this and they ran great. >>: So you didn't change your heuristics at all after OpenSSH? >> Mona Attariyan: >>: No. You didn't touch anything after Apache or Postfix? >> Mona Attariyan: No. And then we also did X-ray and that was fine too. 17 >>: Okay. >>: So behind the [inaudible] therefore INS is completed, which has happened to me. Does that show up as a token here, as a system call failure, or how does that show up in your system? >> Mona Attariyan: I think it's permissions. Is it kind of, you know, something closed maybe to Unix five permissions? >>: [indiscernible] create files [indiscernible]. >> Mona Attariyan: So here, I'm looking at the parameters in the configuration file so it won't show as a problem that I'm looking at, but this technique can be extended to also follow those kind of configuration values as well. So it can be easily extended to actually we're doing it, extending it to also ->>: [indiscernible] log windows machine, because the permissions check is in the kernel. It's not in my IS process. >> Mona Attariyan: Let me try to understand. So you're trying to read something and it says that you don't have the permission, right? That's pretty much enough for us, because we say that, okay, permission here. So for instance, for Unix, you know, you perform a system call. It gives you a code that says permission denied, something like that. So we just use that and say, okay, permission was wrong for this file. So we don't really need to go into kernel and see how exactly it does it. But then, you just use the result. >>: At this point, you didn't have the information flow from the actual permission setting on the slide through to the failure of the call? >> Mona Attariyan: No. We have it -- we just see the end of it that says, okay, you don't have permission. We certainly don't follow kernel as well. When we have a system call, we don't go into the kernel. >>: Thank you. >> Mona Attariyan: >>: Andrew? So the object of this analysis is the name of the code figuration variable 18 that is incorrect. But do you know what the right value is, or do you know how to fix the problem? >> Mona Attariyan: We don't. So we basically tell you that these are the options that are most likely causing your problem. We don't tell you how to fix it. >>: So is it easy, in these cases, like was it easy to fix the problem? >> Mona Attariyan: So here's the thing. Here's why we don't tell you how to fix it. Sometimes changing that option is not necessarily the right fix. For instance, you see that, okay, I can't access this because of authentication problem. You don't want to remove authentication. You don't want to change authentication, right. So you basically suggest to the user, this is causing your problem. It's up to the user to decide whether they want to add something, whether they want to change the value. Of course you can change it and run the tool again to see if there are any new root causes. But we don't necessarily make that decision of whether we should change it or not. Andrew. >>: I guess another way [indiscernible] is that potentially, the configuration of the system is not just a configuration file, but the permissions on the file and ->> Mona Attariyan: >>: That's correct. How does this scale when the size of the configuration set has to be huge? >> Mona Attariyan: That's a very good question. You know, configuration in general is, you know, actually pretty fuzzy. It can be a lot of things. As I mentioned earlier, you know, your library is any file, you know, in your system. The variables, environment variables, all of these are considered to be configuration. And also, configuration file itself can be maybe extremely huge. We didn't really try, you know, what we tried was like in order of hundred configuration tokens. But I can imagine systems that might have thousands. How does it scale? I can't tell you for sure, because I didn't do it, but there is overhead in terms of the amount of memory that we use, first of all, 19 and also as you're running it, we are, the way that we're doing it, we're copying taint for these configuration options over and over in memory. So as your stake gets larger and larger, you need to do more when you're doing the analysis. Does that make sense? >>: I guess my concern is that if everything in the system is configuration, which it potentially is, then everything is not getting tainted. >> Mona Attariyan: So your application, everything in the system can be configuration, but your application might not read everything in the system. >>: Everything [indiscernible] reads is sort of totally externally [indiscernible] application potential sources of configuration. >> Mona Attariyan: Sure. So your application might start with reading a lot of things from the system. The good thing is that it doesn't use all of that to go down a certain path. It might use all of that to go down all the paths in its lifetime, but you're only looking at one single execution path, and it doesn't use all of that for making decisions for one single execution path. That's the good thing about our system is that when we are writing the code, we are not using all of these little pieces. It's actually, there are actually studies that shows that usually, there's only one or two options that are causing a problem. It's not like ten million different options that are causing your problem. >>: So a follow-up question. So how big is your [indiscernible]. like eight bits or 60 bits or 32 bits? So is it >> Mona Attariyan: So right now, we have one byte per configuration options if you're tracking. And so it gets a little bit too much detail. But the way that we do it is that shows us the weight. So we have, you know, that -- so we have that much maximal that much weight. So each configuration option gets some bytes and if you increase the number of configuration options for each variable, this is going to increase. But each configuration option gets one byte. So a variable is going if you have like 50 different configuration options, a variable is going to have 50 bytes. One each for each configuration option. >>: Is there a shadow memory that gives [indiscernible]. 20 >> Mona Attariyan: Yes. So the good thing is that not all your memory becomes dependent on configuration options. So, for instance, you run Apache and at the end maybe like 70K of it was dependent on configuration option. So you kind of, there's a big overhead in terms of what we keep. But the good thing is that not all of your memory needs to have that much overhead. So now, I can imagine cases where all of a sudden you have giant pieces of memory become dependent on a lot of taint. So we need to kind of -- we haven't seen that case yet, but we need to -- what we need to do is kind of maybe get rid of some of this taint, maybe make it smaller. Fade away some of the taint, maybe make it compromise to keep more. But there's a memory overhead. >>: So you have a lookup table for each variable that points you to this taint? >> Mona Attariyan: Yes. It's a three-level lookup table, kind of like a page table type of searcher. Yes? >>: Do you use [indiscernible] or banner instrumentation to run the taint analysis. >> Mona Attariyan: We do it dynamically if that's what you mean. We use binary instrumentation and we add all the analysis as the binary runs. >>: So it feels like this could be very useful for the developers. Actually another approach is to use [indiscernible] project is this kind of approach to do phasing and to maybe focus on the configuration nodes and maybe those [indiscernible]. >> Mona Attariyan: Definitely. It can definitely be useful for developers, although we try to kind of -- we try to not use the source code so it's also useful for end users and others. But if you have a source code, it's actually going to be much easier. So it can also be used by developers as well. Yes? >>: So I was curious about the need to execute multiple times. So if there are dependencies, you said you were going to maybe change some parameters? >> Mona Attariyan: So let's say you have a -- let's say you have a problem where you need to change two things to get it fixed. Depending on the structure of the application, we might give you both of them. It really 21 depends on how the application checks for them. We might give you both of them, like first and second. And then you fix the first one. We don't tell you you have to change them at the same time. You fix the first one. It doesn't go away. You run again and you see the second one coming up and you fix the second one. There are cases where we might miss the second one. And you see the first one, you change that, you run again, and then you see the second one. So we may or may not show you all configuration. It really depends on how the application checks for this. So for instance, if you change this and now you go to the alternate path, we kind of, when we were exploring the alternate path, we maybe aborted early, we didn't see the second one. Then we might miss it like that. >>: So I wonder if you're going to sort of allow multiple runs in your experiment and observe the outcomes, then are there other approaches so you can -- and it's sort of analogous to sage about you don't need symbolic execution. I mean, you have your input file or your configuration file and you have a hypothesis that some byte is the cause so you change that to some other value and then you run again and you observe and you do things like code coverage or [indiscernible], things that are very cheap approximations to taint. But instead of tracking taint, what you say is, hm, most likely, if there's a change to this one byte and I look at the code coverage before and after, the differences in the code coverage are only, if everything else is deterministic, right, then if that's the only change I've made, then if I see these differences in the code coverage, those are likely very related so the change. >> Mona Attariyan: So I think that approach works very well for phasing testing because you change something and you see it. Here, we don't know what to change, necessarily. You have hundred different options, which one are you going to change. That's the problem here. What we are trying to give you is that we tell you, okay, these three, maybe, are the most important ones. So maybe you want to go use something like that, like change them a little bit and see how it goes. But it narrows down what you need to look at a whole lot. Simply by ->>: So you're saying it's sort of a complementary? 22 >> Mona Attariyan: Yeah, once you you want to fix your problem. You You want to automatically fix it. bit by bit and see where it goes. do that, then you might be -- so let's say say that, okay, this option is my problem. Now maybe you can go change it, you know, Yes? >>: Follow-up question. So since you know the kind of the failure, have you done kind of backwards slicing to see what options are in that code to see whether your thing is now -- can narrow that much farther than the simple backwards slicing? >> Mona Attariyan: So we didn't do backwards slicing. The main reason is that backwards slicing for really long execution path is not very successful. So here, we have cases where you usually happens that you read configuration at the very beginning. You run a long path. Sometimes you go through processes, and then you get to the error. For Postfix, for example, there are five processes before you get to the error. So the configuration is in this process. The error is in other process. Backwards slicing cannot really go that far and kind of has problem in going up for really long executions. So that's one of the reasons that we decided to not do program slicing in the first place. >>: [indiscernible] very interesting [indiscernible] that's my question. >> Mona Attariyan: Okay. So this is the other configuration that James wanted to see. So yeah, so we use the tool, it's called ConfError. It's developed at EPFL. And what it does is it randomly generates bugs in configuration files that looks like human errors. Why do you laugh, Andrew? >>: Do you need a tool for that? >> Mona Attariyan: I developed a tool. It's actually pretty good for testing. If you want to see your application, if it fails horribly, if your configuration value is wrong or it dies gracefully, that's the tool that you use. So it was very useful for us, because then we generated 60 bugs using it and we didn't change any of our heuristics, as Stuart asked, and we ran ConfAid again. And in 55 cases, ConfAid was able to rank the correct root cause first or second. 23 So again, 85%, it was actually ranked first. And 7%, it was ranked second. And there were five cases where we didn't rank well. Worse than second. So three days cases -- yes, Ed? >>: Go ahead and finish. >> Mona Attariyan: In three cases in Postfix, the correct root cause was a missing configuration option, and that's something that ConfAid doesn't currently support. So you had to add a new configuration option to fix the problem. So that was the three Postfix ones. The Apache ones, right correct configuration option was ranked ninth and that was the direct result of a rating heuristic. And the OpenSSH one didn't finish. We needed some more support from our system call so it didn't complete. Yes? >>: [inaudible]. >> Mona Attariyan: So quickly, I'm going For the real world bugs, the average time troubleshooting to finish. Again, I want replay execution and not online. And for 23 seconds. to show you some performance results. is one minute and 32 seconds for the to emphasize that this runs on the the randomly generated ones, it took Going back to Ed's question, these turned out to be shallower bugs, you know. For instance, you had, you know, a configuration option that only accepted one to ten, and you gave it 12. Of course, you know, right away it failed. There were ones that were kind of deeper, but usually they were kind of easier. The real world bugs turned out to be much harder. Okay. So kind of final note on ConfAid. People usually ask, you know, so why ConfAid is successful. And here is my thought on it. So usually, the configuration problems that we see, usually once you find the root cause, it's kind of obvious. Most of the time, there is one or two configuration options that are causing the problem and, of course, we can have a case where you have 20 different configuration options causing a problem, but that's very rare. There are actually studies that are published that actually support this. That usually one or two configuration options are causing a problem. So yes? 24 >>: Quickly, this seems like a big claim. I wonder if you actually talked to administrators or something to ask [indiscernible] and then they could actually go and fix it? >> Mona Attariyan: So you're asking whether it's the answers are correct or whether the one or two configuration option problem is? >>: The output is useful. >> Mona Attariyan: Oh, okay. So I think the way that we evaluated the output use or not is just by looking at whether it was the correct configuration option that was causing the problem. Whether you go and change it and then that would fix the problem is a different question, I think. So we recreated the problems and we saw that, okay, this is telling me that this is your root cause and this is the correct root cause that it's telling me. We actually use it a couple times for our problems as well. So we found it useful. We didn't ask any administrators to use it, though. The problem mostly was they didn't want to run our deterministic record and replay in their kernel. So we need to convince them to do that. But I think it is useful. We found it in all the cases that we tried, we found it to be able to narrow down the options a lot. We found that very useful. >>: I think the output, I mean [indiscernible] like say the output up here is very different to somebody who is building the tool to somebody who is striking that [indiscernible] versus somebody who is just about edging and running that program. >> Mona Attariyan: Correct. >>: So it would be kind of interesting to see if you take this output to admins and show they the symptoms and the output, see if they actually can fix it. >> Mona Attariyan: We actually feel that our tool is probably most useful for people who may not have written the code, may not be familiar with the code and they're just using it. Because what it gives you is a very general, like is very high level result. It's not going tell you that variable X. It's going 25 to tell you that this configuration option that you actually have access to and you can change. >>: There's a missing part in the argument where you haven't closed the loop. >> Mona Attariyan: >>: Whether it's useful in -- Yeah, for admin -- >> Mona Attariyan: Any system, if you can give it to people and they come back and tell you that we used it and it was great, of course it's going to be awesome. We unfortunately didn't have the time to do that. We thought that within our group, we used it and found it interesting. We are making it available, actually, the source code and it would be interesting to see if people find it useful as well. >>: I have a comment. I think the user -- the proven user runs on desktop is more complicated than the server program, like OpenSSH. And would they have -particularly on Windows, right. If I just open a program, there's a large number of [indiscernible] should be accessed and many files we opened and they involve many DLL. >>: Do you think [indiscernible] is more complicated than [indiscernible]. >>: She evaluated, OpenSSH, the three -- >> Mona Attariyan: Have you seen Postfix? applications on desktop ->>: So let me reformulate my question. So I believe that there are many How large is the taint source? >> Mona Attariyan: How large is the taint source? configuration file? >>: How big is the Yes. >> Mona Attariyan: >>: Hundreds. they read. In order of hundreds of configuration tokens. I mean look at a user program and look at how many registries 26 >> Mona Attariyan: Sure. I don't think that is necessarily going to, you know, translate into bad results. Of course, if you have much larger set, it might. It certainly affects performance. It would be interesting to see if it's going to result in, like, worse output too. I don't necessarily think that it translates directly into worse results and more false positives. I do believe that some of the server applications are pretty complex. Postfix is a nightmare. We also did PostgreSQL, obviously it's a database that also was pretty complex too. So we didn't just try simple applications. Yes? >>: There's actually [indiscernible] outsources IT, hiring someone to fix your computer over the web. Just thinking you might want to be able to look at this in context and say you've got X. He's going to have to go through [indiscernible] today. Maybe this would cut down the time per call. >> Mona Attariyan: Sure, that would finish my thought here, so finding a needle in a haystack. There's a lot find it, it's obvious. And the good at finding needles in haystacks, and be so successful. be interesting. All right. So just to root cause is basically like finding a of work that you need to do. But once you news is that computers are actually good I think that's why ConfAid turned out to All right. So moving on quickly, I'm going to talk about X-ray. So far, I talked about configuration problems that lead to incorrect output. And there is another big category of configuration problems that cause performance issues. And don't necessarily cause incorrect outcomes. And X-ray deals with those kind of misconfigurations. So what do you do when you have a performance problem? Usually, people use monitoring tools. Profilers, tracing, logging, to see what's going on in the system. The problem with all these tools is that they tell you what events are happening in your system. What you really want to know is why those events are happening in your system so now you need to manually infer why and that's the part that needs a lot of expertise. So wouldn't it be great if you could automatically infer why as well? And I mean, it would be even greater if you could have it ranked list of root causes. 27 I see a smile. And that's exactly what X-ray tries to do. So X-ray currently analyzes latency, CPU, disk and network. You can use X-ray to analyze at the granularity of one single request. For instance, for applications like servers that handle requests. Or you can analyze over a time interval. And X-ray also gives you this powerful tool where you can analyze two or multiple different requests that you think they should have similar performance, but they don't. So here are a bunch of questions that you can ask X-ray. For instance, you can say I have a server. Why is this request being handled so slowly. Or why is CPU usage over time over this time interval? Or have these two different requests, I think they should be similar, why are they different? So let's talk about the idea of X-ray. So we call it performance summarization. So at ConfAid, as I explained, we were basically interested in finding out why a certain piece of code, for instance, an error, ran. This bad red block of code, why did that run? In X-ray, the problem is we don't know where this red block is, but nothing really prevents us from treating the entire code like red blocks of code and determining why all events in the code ran. That's exactly what we do. So from a really high level, this is how X-ray works. We assign a cost, basically, a performance cost, to different events of the execution. Those are instructions and system costs. And then we determine, using a ConfAid-like analysis, why each of those ran. And then we associate this performance cost to the root causes that we just determined and then we aggregate over the entire execution and then we rank the results. So I'm going to use one of the examples that I told you in the last slide. I'm going to walk you through X-ray, tell you exactly how it works. So let's say we have a server. It's handling requests and one of the requests is particularly slow. We want to know why. So first step, as I mentioned, X-ray analyzes execution. But here, we are interested in the execution that's related to that single request. Not the entire execution. And that is not always a straightforward. Sometimes you have an applications that use multiple processes to handle requests. You know, the request comes in, it runs for a while in one process. It then goes to another process and it continues. We are basically interested in all these 28 blue pieces, all of these are relevant to our request. That's exactly what X-ray does. As the request travels between processes, it collects all these executions that are relevant and then once it's done, it basically says okay, these are all of the execution pieces within all of these process that I care about. So once we have that, then we do the cost assignment. As I mentioned, we assign a cost to the events, and the events are instructions and system calls. Here, we want to see why a certain request is slow so we want to look at latency. And the latency is basically for system calls is execution time of the system call that we collect online as part of the record. And for right instructions, we do, we approximate the execution time of each instruction and the [indiscernible] to extra instruction. Yes, Andrew? >>: [indiscernible] takes a very long time because the contact switch is around somebody else. How does this play into this? >> Mona Attariyan: Good point. If you analyze that single request, it might be misleading, because that single request wasn't running. It was just sitting. What you want to do is look at a time interval, because that would include other processes that were actually running at the same time. So here is the point. We give you a bunch of different tools and then you can, you know, you're running this on replay so you can do it multiple times with different types of analysis. You can analyze a request. You can analyze over a time interval. You can do different things to figure out what's going on in the system. This is something that we kind of rely on you as the admin to figure out. basically use the tools the best you can do. To Okay. And yes. So the timings are all collected online. So the analysis is not going to perturb the timings. And then once we have the cost, so, for instance, let's say we have like a small block of code. We assign ten microseconds maybe to it and then we have a long block of code, maybe it has a really long system call and it took the cost of hundred microseconds. Then we determine why each of those ran. For instance, very simple case, maybe the first one ran because of configuration option A. The second one ran 29 because of B. We assign the cost to the root causes, and then we aggregate over the entire execution and we rank. What does this mean? That tells us that X-ray thinks that B is a bigger contributor to the performance of the problem compared to A. So if you're the admin, you want to see why it was slow, go look at B, because that's causing a larger performance cost for you. And then go look at A. So maybe you can't remove B, but this tells you why it was happening. Okay. So as I mentioned, X-ray also gives you this powerful tool where you can compare two different requests and see why the performance of these between requests are different from each other. We call it differential performance summarization. Here's how it works. We have two requests. We extract the execution pieces of both of them, the way that I mentioned. And then we compare them and find the points where the execution diverges. We call them divergence points. And then what we do is that we calculate the cost for each part of this execution and the difference is the cost of the divergence point. And then we basically do the same thing. We find out why the divergence point happened. If it's like, I don't know, maybe if conditional, one of them took the if part and one of them took the else part and that is the difference between the cost, we assign the cost to the root cause and then we do that for all the divergence points. Finally, we give you the list. It tells you that A is the biggest contributor to the divergence between the two requests. Not necessarily the performance of each one, but the divergence between the two. So now, you might ask, okay, I have thousands of requests. How do I know which two? It's a hard thing do. So we decided to do that for multiple requests as well so you can tell us, okay, have hundred of requests. Tell me why these are having different performances, what is causing different performances and what is the cost. So what we do is that we kind of compare all of them. We find the shortest path from the beginning to the end and note that the shortest path is not necessarily a single request. It's going to be a combination of some of the requests. And then we find all of the divergences from the shortest path, we determine the cause, we determine the root causes and we basically give you a 30 visual kind of explanation of these are the divergence points. cause, yes. These are the >>: Divergence points, are you assuming that [indiscernible] in a control file? Because you have B previously on the slide. >> Mona Attariyan: Good point. Give me what do you have in mind? >>: For example, here's a request that comes in that would apply. A file named A feed the depo, but it's a man. File named B, feed the depo, but it's an iron mountain in Utah. >> Mona Attariyan: Yeah, yeah. So there are cases, especially when you go to the system call part, where because we don't follow that part, where the input to the system call can cause divergence outside that. That we currently don't follow, but we should. We have that in the future kind of direction. But because we don't follow -- if we were following the kernel, we would see that eventually as a divergence and control. >>: There could be, like, you know, network is congested. taking longer. Would you capture that? That's why it's >> Mona Attariyan: We are trying to look at the configuration reasons. Of course, we can have, you know, my disk is slow because my disk is broken, or my network is slow. So what we are kind of expecting here is that you as the admin kind of look at different potential problems. My hardware not being correct, my network being slow or I'm having a configuration problem. The good thing is that finding out that, for instance, you know, my hardware is broken or my network is congested for some reason, there are many good tools that allow you to, you know, explore that and find out about that. We try to focus on the configuration file where we thought that there are not that many good tools. >>: Do you compare, like there's a common performance profiling tool is like this is a function of being [indiscernible] on 20 result time and other function. So what's the advantage, this advantage what your tool can provide compared to that kind of performance profiling tool? 31 >> Mona Attariyan: That's a great question. Our tool gives you a much higher level idea of what it can do. So if I'm a user. If I'm using an application and you tell me that this function is called a lot of times, I can't do anything. If I'm not a developer, if I'm not looking at the source code, telling me that this function is specifically like running a lot is not helping me in the overall, you know, solution of the problem. What we are trying to give you here is that, okay, this option that you can go and change is causing you trouble. See what I'm saying? So if you're the developer, that might be a good thing because then you can go to that option, that function, and then do something. But if you're just using it, giving a low level detail of what is going on, it's going to be useless to the user. >>: I'll wait to see your evaluation. >> Mona Attariyan: Sure. That goes right here. So this is actually a work in progress. We're actually still doing some more evaluation, but this is the preliminary result. We did Apache, Postfix and PostgreSQL. We found 14 test cases of performance problems that people found online and, you know, reported. And we recreated them and we ran X-ray. And in 12 cases, the first option that X-ray returned was actually of the biggest contributor to the performance problem. In two cases, it was the third option that it returned. Yes, Andrew? >>: Give me an example of what a test case is? >> Mona Attariyan: Sure. Okay. So let me give you an example of maybe the PostgreSQL. So PostgreSQL, for instance, it has a write log that as it does the transactions, it writes into a log and then it later commits. So now, if you have it -- and then it basically does a snapshot or checkpoints of the log so that if you crash, you can come back. So if your system is under a lot of load and you do a lot of checkpoints, it's going to have -- it's going to put even more load on your disk. So the problem that that person described was that I'm having, you know, my disk is under a lot of load and then people suggested that, okay, good look at how frequently you're doing your checkpoints. And the person came back and said, okay, maybe I'm doing checkpoints too often. Or something like that. 32 >>: In the case of a thing like that, you would say here's one request that went normally. Here's one request that went really slowly because it had to make a checkpoint while it was doing the request? >> Mona Attariyan: Okay. So let me first mention that in these 14 cases, we have some of them that we did pair request analysis, some of them we did time interval analysis. Some of them we did comparison. For the one that I just described, we did the time interval analysis, where we said, okay, we look at this, this minute to this minute, and then we saw that there's a lot of disk usage, and then we say the option of checkpoint interval that is in the configuration file PostgreSQL is causing a lot of that disk usage. Now, we had, for Postfix and Apache, we had cases where we look at requests specifically. For instance, for Apache, there was this request that was specifically long and then we figured out that it was doing, you know, extra DNS lookups. >>: How does the user interact with this information? >> Mona Attariyan: Like if you run -- The result, or how does it do the annual analysis? >>: If you run your tool over a time interval and something is going to [indiscernible] disk access, what do I as the user give to the system or what does it give back to you? >> Mona Attariyan: So you tell the system, I want to look at disk over this time interval. And the system gives you ->>: This is disk access. >> Mona Attariyan: So as a user, you'd say okay, I seem to -- my disk seems to have a lot of load. My network seems to have a lot of load. So the thing that is you detect a problem as a user first and then you tell us what you want to look at. You can, of course, do, you know, okay, over this time interval, tell me about my disk, tell me about my network, tell me about my latency. You can do that as well. The good thing is you can run multiple times on the replay and you're all fine. But as the user, you need to first detect a problem and then try to diagnose 33 the problem. that. >>: The part is that we don't tell you that, okay, your disk is doing You root caused it back to a configuration file? >> Mona Attariyan: Yes. It might be that your disk is broken. Then, you know, you're not really looking at a configuration problem. But if you are, then it tells you. Okay. So we have a few minutes. I'm going talk about some of the future directions that I would like to pursue. So with software systems becoming more and more complex, the very difficulty of software reliability in troubleshooting seems to just be getting more challenging. And I believe that software reliability is going to be one of the most important research topics in the future, and I very much would like to pursue a couple of different directions in this field. More specifically, I like the problem of troubleshooting software that runs in larger scale and also troubleshooting software that runs on a platform with limited hardware resources. So large scale analysis. Today, we have software that runs on scales larger than ever. We have very complicated distributed systems and troubleshooting is specifically difficult on these environments. Even before you get to the diagnosis and solution, you need to detect, as I was explaining to the previous question, Andrew, you need to detect that the problem exists. And detecting abnormality is not very straightforward in these cases. Usually, today, it's left to the admin and then how they do it is that they basically look at the logs and try to see if they find any abnormality. And this is really difficult, because they keep the log to the minimum. So the question I'd like to answer is, is it possible to automatically find these kind of abnormalities in the system. And once you find them, maybe you can collect more diagnostic information and then you can do a better troubleshooting analysis in the future. I also would like to look at troubleshooting for software that runs on a platform that has limited hardware. So mobile computing is greater than ever. We have smart appliances everywhere. These platforms run very complex applications. But there are still very limited in terms of computational 34 resources and in terms of energy and battery life. So when we're designing solutions for troubleshooting for these kind of environments, we should take into consideration all these constraints that they have. So, for instance, is it possible to maybe offload some of this troubleshooting to the cloud in a safe, secure and efficient manner so that we're not using that much of the precious resources that we have on the platform. And also, for desktop computers. We have done a good job of making our application more user friendly, but when it comes to troubleshooting, we still have a long way to go. And for desktop computers, any impact that we have on the troubleshooting is going to be huge, simply because of the number of people who are going to be affected by that. I have a bunch of ideas what we can do for making troubleshooting easier on desktop computers. I'm going to share with you two of them. The first one is that the configuration state is usually shared. For instance, you have Windows registry, things like that. And when you configure one of the applications, that means that you might be breaking another application. So is it possible to detect perhaps this thing that you're doing is going to break something else and then the let the user know so they're aware of the consequences, the actions that they're taking. Another idea is that, you know, usually when you're configuring your something new, a new feature, you might need to change multiple things and modify multiple different things. What people usually do is they do half of it and then there's a problem at the end. Is it possible to automatically figure out what are all the possible configuration options that you need to change at the same time and then tell the user so they can configure the system correctly. All right. So conclusion. So problems unfortunately are inevitable in complex software systems. I showed you that misconfigurations are dominant cause of problems in deployed systems these days. And I showed you that execution analysis can greatly improve diagnosis of these kind of problems. I talked about ConfAid and X-ray. They both use dynamic information flow analysis to do this, and I show you that they can be actually pretty successful. And that concludes my talk, and I'd be happy to take more questions. 35 >>: I think we have time for one more question. All right.

1

Related documents

Products

Support

1

Related documents

Add this document to collection(s)

Add this document to saved

Suggest us how to improve StudyLib