1

1 >> Jon Howell: It's my pleasure to introduce Raja Sambasivan from CMU. He's a Greg Ganger student and he's at the parallel data laboratory, and he's spent a lot of time building distributed systems, which leads him to the problem we're going to learn about today, which is diagnosing the problems that happen in them. Give him a moment to get the recording mic on. >> Raja Sambasivan: All right. So my name is Raja. Today, I'm going to talk to you about a new technique for diagnosing performance problems in distributed systems. All right. So what's going on, why give this talk? Well, as probably a lot of you know, diagnosing performance problems in large distributed systems is just really time consuming and just really difficult. And this is because the root cause of the problem can be any one of the many components, sub components, or functions of the system. So one of the things we clearly need are tools that will automatically localize the source of a new problem from these many components to just a few relevant ones. Request flow comparison. The focus of this talk, is a technique I've developed for automatically localizing the source of performance changes. It relies on this insight that such changes often manifest as mutations or changes in the path request take to the system; that is, the components they visit, the functions they execute and so on. Or in their timing. So exposing these differences and showing how they differ from previous behavior can localize the source of the problem and significantly guide developer effort. All right. So to make this approach a bit more concrete, I want to describe my personal experiences debugging a future addition to a describe distributed storage service called Ursa Minor. This storage service is comprised of many clients, many NSF servers, many storage nodes. Did I say metadata servers? Okay. Metadata servers too. But for the focus of this talk, I want to concentrate on a case that was just one of each and two storage nodes. If there are any of you who are not familiar with distributed systems terminology or distributed storage terminology, the names of these components don't really matter, just that they represent different things in distributed systems. All right. So in this system, to access a file's data, requests must first 2 send a request to the NSF server. The NSF server must look up the location of this file's data and things like access permissions and so on by sending a sub-request to the metadata server. This metadata server access might itself incur a storage node access. And once the NSF server has the metadata or the location of that file's data, it's free 0 access the storage nodes directly to obtain that information and then respond to the client. So there are two key things to the nodes here, right. First, all this work done, you know, within the components of the system across the components and so on is the work done to service this request. The work done to obtain the data the client requested. This is also what I mean when I say the path of the request of the system or the request flow. It's this work flow that you see here for unusual requests, right. Second thing to notice is that in this architecture, every single file access from the client requires a metadata server access, right, and hence a network round trip, right. This isn't that big of a deal for really large file workloads who are accessing a few really large files, because the cost of the single metadata server access is amortized over many storage node accesses, right. So it's not that big of a deal. But it's a much larger problem for accessing a lot of really small files. Because then the cost to this metadata access becomes much more prevalent. So for a small file, maybe there's only one storage node access and one metadata server access. So the cost of this metadata server access is much more prominent and, you know, it's much more visible than the end tone latency. So being a new grad student at the time, I just started seeing, I thought I would help out. So I thought I would add a feature to the system that would make things better for small file workloads. We added a very simple feature called metadata -- server driven metadata pre-fetching, right. And it's a very simple feature, and all that happens is when the NSF server has to access the metadata server to obtain the location of file's data, the metadata server will turn additional related locations or metadata for files that it thinks we accessed soon in the future. 3 So in this way, hopefully you're limiting future accesses to the metadata server and improving the performance of the small file workloads. So with this future addition, clients would access the metadata server -- sorry, the NSF server. The NSF server would already have the metadata at a location that the files data prefetched and then so the NSF server could access the storage nodes directly and then respond to the client. All right. So you think this would be faster, right? I mean, there are less arrows. It really should be faster, right? But as it turned out, adding metadata prefetching killed the end time performance of a lot of the nightly benchmarks we ran in the system. We had no clue why. What's going on? And so I want to take a quick break here and I want to ask you guys about how you'd go about diagnosing this problem. What would you do? Make parse a couple log files, add printouts here and there, maybe use GDB in some of these components, right? Take a second. All right. I want to quickly go over how I initially tried to solve this problem. I worked really, really hard trying to figure out what was going on. This guy here is drinking coffee. In my case, it would be diet cokes, right. Second, just give up. And third, convince myself I was never, ever going to graduate. And it still might not happen. Who knows. On the other hand, request flow comparison, this technique I developed or diagnosing performance changes, would have immediately identified the mutation or the change that gave rise to this performance problem. It would show that certain requests observed after the future addition; that is, after this performance change accessed both the NSF server and the metadata server but incurred many repeated database lock accesses, several levels of distraction below where the prefetching function was added. We show these extra database lock accesses occurred significant amount of time, here about three milliseconds. We also show the precursor service mutation; that is, the anticipated flow of the same request before the performance change and before the future addition. We showed these requests also accessed the NSF server and the metadata server, but incurred many fewer of these database lock accesses. So by comparing the two, and seeing the difference between them, perhaps I would have gained intuition necessary to get to the root cause of this problem; 4 that is, the cost savings generated by prefetching was outweighed by the extra cost of doing the prefetching itself. So here we see that request flow comparison helped us out by showing how this changed flow that was much slower, differed from the original flow. And gives a starting point for our diagnosis efforts. So in summary, our approach of comparing request flows identifies distribution changes in request flow timing and structure. And in doing so, localizing a new problem to just the components that are responsible. Now, request flow comparison can't be used to diagnose all sorts of problems, right. For example, it can't be used to tell why performance has always been slow in a system, but it does have many use cases. For example, you can use it to diagnose performance regressions. You can diagnose regress performance with the -- go ahead. >>: So if you had been perfectly omniscient, though [indiscernible] would you then have had a win? >> Raja Sambasivan: So the question basically is why was this a performance problem? It was a problem because we were prefetching long items, for example, right. Yes, that was one of the items. We were perhaps about 75 percent to maybe 80 percent efficient in prefetching things, but there was another architectural issue, and that's the metadata server didn't know what -- wasn't keeping track of what had already been prefetched, right, and so it would sometimes retrieve duplicate items. So I guess the take-away there is yes, there's entire hosts of things after the problem localization that a developer has to keep track of and understand to get to root cause, right. But the tool here was giving me a starting point for the efforts. So this said you could use request flow comparisons to diagnose regressions. You can compare request flows generated from after the regression or the slowdown and compare it to the request flows generated before the slowdown to understand why performance is so much slower during the regress period. Another interesting use case of request flow comparison is to diagnose workload degradations within a single period. You can imagine extracting the last 5 thousand requests seen for a workload and comparing it to the first thousand to see why the last thousand was so much slower. A third interesting use case of request flow comparison is to eliminate the distributed system as a root cause. You can imagine comparing request flows between two periods and finding that the tool says that nothing within the system has changed. There are no changes in timing or in the path to the system, in components accessed, the functions executed and so on. And perhaps that would give you some intuition that problems not internal to this system, perhaps caused by something external. Perhaps external processes, something in the environment and so on. A final point here is that request flow comparison is a first step towards this larger goal of creating self-healing systems that are capable of diagnosing problems automatically without human intervention and I'll get back to this a bit later at the end of this talk. All right. So the rest of this talk, I'm going to talk about an implementation of request flow comparison and a tool I built called Spectroscope. I'm going to describe case studies of diagnosing real, previously undiagnosed problems in a distributed storage service called Ursa Minor, which identical to the architecture I showed you earlier in slide 3 of this talk. We also used request flow comparison to diagnose problems in certain Google services, though that isn't the focus of this talk. Finally, I'll end in my road map for achieving this goal of automated diagnosis and self-healing systems. So getting to the more fun parts of this talk. In this slide, I'm going to show the work flow of request flow comparison and how it's implemented in this tool I built called Spectroscope. And I'm going to use this as my outline slide for the majority of this talk. So the Spectroscope takes as input request flow graphs, right, from a non-problem period and a problem period. It bins similar requests from both periods into the same category and uses these categories as a fundamental unit for comparing request flows. Using statistical tests and various other heuristics, it identifies those categories that contain performance affecting change, which we call mutations. 6 We consider two types. Response time and structural. these in more detail later in the talk. And I'll talk about It takes these categories that contain mutations and it ranks them according to their contribution to the overall performance change so as to guide developer effort. And finally, it presents results to the user in a nice user interface. So for the rest of this talk, I'm going to talk about the heuristics for each of these components for Spectroscope's work flow. In choosing these heuristics and algorithms, we tried to choose ones that were simple and ones that limited false positives, which we feel are probably the worst failure mode of an automated diagnosis tool like Spectroscope because of non-developer effort. All right. So first let's talk about how we obtained these graphs. To obtain request flow graphs, we leverage end-to-end tracing techniques which capture the control flow of individual requests within and among the components of the distributed system. End-to-end tracing has been described extensively in previous research. For example, in mag pie, star dust, X trace and it's even used in production Google data centers, via Google dapper. And for those unfamiliar with this, it works as follows. Trace points are either automatically or manually inserted in key areas of distributed system software. Now, they can be automatically inserted at component entry and exit points, usually. For example, if there's a shared RPC library, you can add instrumentation there and now you have the entry and exit points of your components of the system instrument automatically. And then once you have the component level instrumentation, developers are responsible for adding instrumentation points within components as they see fit. And when I say instrumentation points and trace points here, just think print def. Just a simple log statement saying where you are in the custody and what component you're in and what behavior that's being described. So once you have these instrumentation points in the system, doing run time, the end-to-end tracing system keeps an ID with each request that's observed. And it associates this ID with logs of these trace points that are captured at the various components. And what happens is offline or later during execution, these IDs can be used to tie together trace points accessed by unusual requests and create these graphs at the request flow within and among components of the system. 7 So end-to-end tracing can be implemented with very little overhead, less than 1 percent, as long as request level sampling is used, in which a random decision is made at the entry point of the system whether or not to capture any trace points or any instrumentation for that particular request. So this is an example of a request flow graph for a read operation in the [indiscernible] storage system. In this case, the data is actually stored in two different storage nodes. In a real system, it might be stored over many more but I can't get everything to fit in one slide here. >>: Is 1 percent [indiscernible] performance flows most of the time and then every once in a while ->> Raja Sambasivan: Yeah, so the question is one percent sampling enough. The answer is it depends, right. So one thing that that Spectroscope doesn't try and target are anomalies. It looks for distribution changes and timing and topology or structure requests. So we're not going out to really rare cases to start off with. Second, you are right. The actual sampling rate that is required a trade-off between or is a compromise between the workload size and the amount of data you need to capture for statistical confidence. And that depends both on performance variance and also the incoming request rate. So we have a really large workload that has very little performance variance and so you can -- in the best case, it's deterministic and you capture one sample. Or you have a really small workload with lots of performance variance and you capture around 1 percent. You can actually do the math to work out what you need to sample. So it's not, you know, just blind. >>: So this random sampling, does each entry make its own sampling [indiscernible]. >> Raja Sambasivan: Yeah. So the question is do individual trace points make their own decision or is it correlated globally. The answer is it's correlated globally. So it's a decision made at the very first trace point, at the entry point the system, the top level of the system. >>: [inaudible]. 8 >> Raja Sambasivan: There's a special ID that you don't sample that's propagated with the request. The library call sees that. It just doesn't capture a trace record for it. All right. So on this screen, you see an example request flow graph of a read operation. So let's say there's, as I said, a read call. It enters the system and so the top level component of Ursa Minor, the NSF server, logs this NSF read call trace point. Next, let's say this request travels to the cache component of the NSF server. And this isn't it, right. So here we see another trace point that says we're missing the cache. We also see that the request takes about 100 microseconds to travel from the entry point of the system to the cache component. So next, the request travels to the network layer of the NSF serve sore that two concurrent RPCs can be made to retrieve the data that the client requested. And you would see to get to this component, it took about ten microseconds. Next, we see two concurrent reads to storage nodes on which this data is stored. In a real request flow graph there would be additional nodes and edges showing work done on each of these components. But once again, I need to make stuff fit on the slide so everything is a tracked down to a tiny little bubble. I had debated bringing a much larger projector for a real graph but I haven't done that yet. So we see the work done on storage node one finishes first. It takes about 3.5 milliseconds here. So the graph would show that this storage node replied first. We see the reply at the NSF server and we also see the network time. In this case, we have a really slow network so it took about 1.5 milliseconds for the data to be transferred from the storage node to the NSF server. >>: So what are you seeing about [indiscernible]. >> Raja Sambasivan: That's always a good question. So we actually do not assume [indiscernible] in our implementation. So when I show these times here, what we're doing, actually, is we're creating a what happens before relationship between the two components and then I'm capturing the end-to-end time on the calling component and basically just dividing that by two and placing it on each of the edges. So it isn't perfectly accurate, but it doesn't assume end-to-end clocks. 9 Now, I was at Google for several months and I implemented all this stuff on top of their tracing system. What I found really amusing is they did not care at all. They didn't care. All their clocks had skew. They didn't care. They saw a negative time on the edges, they would just reset it to zero and move on. So I found this really interesting disparity between what they were doing in production and practice. >>: They're not doing that because they're clowns or -- >> Raja Sambasivan: They just didn't care. It wasn't a big Dale. >>: Did they not care because they're clowns or did they not care because there's some deeper reason why it's okay to arbitrarily truncate the zeroes? I don't know. I mean, to me, that seems like a bit of a silly risk because you actually do want to capture these anomalies with some notion of time. >> Raja Sambasivan: I think that for them, their network tended to be really fast to start off with, and that did not tend to be the source of any problems, many problems they ran into. If that became an issue, I'm sure they would take care of it. >>: [inaudible]. >> Raja Sambasivan: Yeah, pretty much. I mean, and you know, in the case where there was a really large slowdown, right, having synchronized clocks, you know, you want your clocks to be synchronized to within a small amount, right. But if your slowdown is like a second or something, it's still going to show up. So you just didn't care. All right. So sometime later, the second storage node replies. Once again, the network transfer time is really slow. And I want to add that we had really slow networks at CMU. So for us, this is a big deal. And once the NSF server has all the data necessary to reply, the client, it logs one final instrumentation point, saying it's done, and then this is the request flow graph that you'd see for this request. So a couple of take-away points. First, nodes of these request flow graphs show trace points reached. There are basically printouts or log statements. Just indicating what function the instrumentation point is captured in, which 10 component you're in and so on. trace points. And edges show latencies between successive >>: When you request machines, does the library capture the very last moment that the request occurs on, or are you incorporating [indiscernible] in that network timing various queues in [indiscernible]? >> Raja Sambasivan: That's a good question. So the basic question is how close are we to the exit point of component at the -- yeah. The answer is we usually instrument this at the RPC library so there's still a lower level of kernel level things. >>: We meaning [indiscernible] trace point in the library and that's as far as you get? >> Raja Sambasivan: Yeah, that's as close as we got. We're not going to [indiscernible] to get, right. I know when I was at Google, they were exploring options to actually instrument the kernel to capture things like this. >>: The network is slow because, you know, [indiscernible] big enough and [inaudible]. >> Raja Sambasivan: I mean, you can imagine also, you know, if you care enough, instrumenting those queues, right, and also capturing things like ping times and things like that to figure out where your [indiscernible] product should be, right. So the next step of Spectroscope's work flow is -- well, to summarize so far, I've shown how we obtained the request flow graphs from both a non-problem period before a performance change and a problem period after performance change. The next step of the work flow is to group similar performing requests from both periods into the same category. There's a lot of this identified mutations or performance affecting changes by comparing per category and per period distributions. In our case, we choose to define similar as requests with identical structures and topologies. That is requests for components execute the same functions, 11 have the same amount of parallelism, and so on. I'm not going to go into too much detail about why we chose this particular grouping, but there is a lot more detail in the RSDI paper about this stuff, and I can also field questions here. So once we've categorized these requests into groups, so requests on both categories are grouped into the same category, the next step with Spectroscope's work flow is to identify those categories that contain performance affecting changes or mutations. >>: Sorry, can you back up one slide, make sure I understand. The categorization step, you put every request into a bucket labeled with the topology [indiscernible] request. >> Raja Sambasivan: Pretty much, yes. >>: So how can you ever have a structural mutation if everything's in the same bucket? Mutations are -- is a difference between two things in the same bucket. >> Raja Sambasivan: So the question is how do you have structural changes if you're grouping requests from the same topology to the same category. I'll get to this in a bit more detail later. But just a quick high level answer is timing changes compare stuff in the same bucket, the same category. Structural changes compare stuff across buckets. >>: Oh, okay. >> Raja Sambasivan: Diving right into that, the first type of mutations we consider is response time mutations. And these are requests that are just timing changes. They have the same topology, the same structure in both periods. They visit the same functions. They access the same components and so on. It's just that they've gotten slower. So to help developers diagnose problems that manifest as this type of mutation, what we want to do is first we want to identify that they exist. And second, we want to localize the problem by showing specific interactions along these request flows that cause a timing change. That slowdown. Now, when comparing request flows that identify response time mutations and 12 localize them, we can't really expect response times to be identical between both periods. So there's always going to be some natural variance in timing between both periods. So one key thing Spectroscope has to do is to separate such natural variance from true distribution changes or mutations. And to do this, we use this [indiscernible] hypothesis test. So these tests compare two distributions to determine whether there's enough evidence to reject the hypothesis that both distributions are, in fact, the same and just vary because of random effects. There is enough evidence, usually about enough to guarantee a false positive rate of less than about five percent, then the distributions are claimed different and we'll identify them as containing mutations. So here is a picture of the response time distributions for a one category. And we see that the requests assigned to this category from the non-problem period shown in the black line are very different from the distribution of response times for requests assigned to this category for the problem period. >>: [indiscernible]. >> Raja Sambasivan: Yeah, the [indiscernible] response time, right. So you see that they have very different means and the variance around the means is very small. They don't overlap a lot. So it's very clear that these are different distributions. They represent different behaviors. So here, we'll be confident enough to identify these as containing mutations. The second category is very different, right. You see that the response time distribution of requests assigned to it from the non-problem period, shown with the black line, is very similar to the response time distribution for requests from the problem period. Their means are very similar. They overlap a lot. They're very, very large. Here, we won't be confident enough so we won't identify these as containing mutations. Now, a question you might have here is why not just use a raw threshold. Why not just say requests that are twice as slow in the problem period. Should it be identified as containing mutations. And the answer to that is that using statistical tests limits the false positive rate. You are guaranteed a false positive rate no more than five percent. So essentially, you're limiting 13 wasted developer effort by using these more robust tests. >>: What test guarantees that? >> Raja Sambasivan: So we're using a non-parametric hypothesis test. just using a [indiscernible] test. >>: Okay. So we're Doesn't that test assume that all of the data points are IID? >> Raja Sambasivan: Yeah, there are certain assumptions in there, right. So it does assume that the various request response times are, you know, are identically independently distributed, right. In practice, that's held up better than other tests, like, for example, we don't do things like assume that the distributions are normal, for example, or Gaussian. So there are some assumptions thrown in there. But in practice, it's worked out. Now, there are cases where that assumption doesn't hold, I do agree, right. But in practice, it's been okay. >>: So I mean, one, perhaps, important assumption you can make this on is that there aren't sources of [indiscernible] changes in the workload that cause all of you're a measurements to be different than your B measurements. If the network suddenly got chattier the second day after you made your [indiscernible], it's going to shift a lot of these curves. Is there any statistical technique you can use to detract that, or do you need to, or do you want to? >> Raja Sambasivan: Right. So I think there are multiple questions in there. The first one is what happens [indiscernible] high level workload changes, right. In that case, what this technique will show you is changes in the workload or changes in timing and so on as a result of the workload change, right. >>: What if you say, oh, look, in every category there was a mutation. What have you done? I mean, how will the system distinguish between changes that are because there was a mutation in an underlying system and changes due to the fact that there are shifts in the external factors. >> Raja Sambasivan: Right. So an assumption of this work flow, and I'll get to this in a bit, is that the work flows are roughly similar between both 14 periods. If they are very different, yeah, you would see effects due to workload changes. We do try and handle a bit of it by saying just extra load, without actually trying to identify as extra load specifically. But in general, we do see some [indiscernible] in those periods. All right. So here's an example of a response time mutation. There's a request that accesses the NSF server in Ursa Minor. Reads data from the storage node and then replies. It's just a very simple request. We see it's much slower in the problem period as compared to the non-problem period. In the problem period, this request took about one millisecond. Whereas in the non-problem period, it only took about 110 micro seconds. So in addition to identifying this as a response time mutation or as categories containing response time mutations, Spectroscope will it rate through the edge latency distributions seen here and will identify this, the work done in the storage node as responsible for the timing change by applying the same statistic tests on the edge latency distributions. So once again, for response time mutations, conditions identifying them, the problem is localized by identifying specific interactions responsible for the overall timing change. So the second type of mutation Spectroscope considers are structural mutations. These are requests of change or topology of the structure. And the problem period, they visit different components, they execute different functions. They have different parallelism than the non-problem period. So to localize the cause of root cause of these problems, what we want to do is we want to identify what we call the precursors or the anticipated path of the same request during a non-problem period. So before a performance change, what would this request have looked like structurally is what we want to find out. And this is necessary for two reasons. First, it lets us identify the cost of a structural change. If you execute a different component, is that slower? How much of a performance change do you see as a result of that? That's something to identify. The second reason we wanted to it fie the precursors is to identify how these two differ or diverge. Because that's the point where the developers should start looking when they want to debug this problem. So to find a starting point for the diagnosis efforts, we want to identify where these request flows, precursor and mutation, start to diverge. So an 15 example of a structural mutation precursor pair is when the metadata prefetching problem I showed you earlier in the talk where a structure mutation is a request that accesses the metadata server and incurred many repeated data accesses where in the flow before that change, right, their precursor would request that it access the metadata server but incurred men fewer of those database lock accesses. Another example of a structural mutation precursor might be requests that now must get their data from far off data centers. Some are, I don't know -Vancouver is not that far, right? So somewhere in Asia, where before they were getting the data from somewhere in Vancouver. Or maybe Redmond, right. So once again, Spectroscope has identified both the mutation and the precursor of the mutation. They anticipate path of that request before the performance change. To identify categories that contain structural mutation we assume similar but not identical workloads for execute both periods. And so what this means is that categories that contain more problem period requests than non-problem period requests must contain structural mutations because those extra requests must have come from somewhere. They must have mutated from something else. Similarly, requests that contain less problem period requests than non-problem period requests must have donated precursors, because they've increased in frequency so those requests have gone somewhere else. In this case, we actually use a different threshold to differentiate natural variance from categories of mutations. So you might say there are category contains 50 more problem period requests containing a structural mutations and a category contains, say, 50 less non-problem period requests contains precursors. >>: If you think about your example, I mean, it sounds like you've actually instrumented [indiscernible]. But if you just look at the network message layer, then what you would, in fact, see is that the structural mutation or the cases where you cache it on the front end, they went faster. And so what you find is that the problem categories right ones that didn't [indiscernible]. >> Raja Sambasivan: So can you clarify that question again? 16 >>: So if you had instrumented [indiscernible] and you didn't see this database lock over and over and over again, and you compare the precursor set of -- before you did your optimization set, which looked like that complicated path, I mean, even if you just skipped [indiscernible]. And then you looked at the piece after authorization, there is a certain mutation. Because in the previous search you always went up to the metadata server and back. In the new one, you most of the time don't go up to the metadata server and back. So structural mutations that you're chopping off is not --. >> Raja Sambasivan: Right. >>: And the places where they vary, where the change has been introduced, your performance is better. The structural mutation is positive. Which presses the whole point of [indiscernible] optimization. And the real problem is that you ->> Raja Sambasivan: We were talking about that specific example. So which you would see is you would see that, yeah, there are extra cache hits so you can see that's faster, right. You would also see that when you did go to the metadata server, the time you spent at the metadata server would be much larger. So you would see both. >>: You made the statement that here what you're going to do is you're going to find certain mutations and this is the source of the problem. And, in fact, the structural mutation at the network message [indiscernible] was not the source of the problem. In fact, it was a benefit. Does that make sense? >> Raja Sambasivan: Yes, you would see certain things have gotten faster. It's a structural change instead of negative performance impact, right. But we would also show you that there were certain other requests that were gotten slower because when they do go to the metadata server, they take more time. >>: In the first one, you would just ignore it. >> Raja Sambasivan: In fact, it wouldn't show up in the ranked list. The second one is the thing that made it slower. And I think you bring upon a bigger point. Depending on your information, timing changes turn into structural changes, right? >>: [indiscernible] deeply enough if you had submitted all the way 17 [indiscernible] because it's deterministic in that layer? >> Raja Sambasivan: Yeah, and so it is -- you can also imagine extending this so say that you had a really large system. You couldn't afford to instrument everything starting out outright. You could imagine having timing changes ->>: [inaudible]. >> Raja Sambasivan: But just within a large system, right, you might want to start out with component level instrumentation like you started out, right. And then ->>: That's exactly what I stick to top level [indiscernible]. >> Raja Sambasivan: And you would expose additional information within components as you saw timing changes. All right, so ->>: Before, about ten minutes ago, you sort of said thresholds are bad, you said because thresholds are bad because [indiscernible]. And here, you're saying we'll just use thresholds. >> Raja Sambasivan: I really wanted to use a statistical test here too. I wanted to build a model of what types of structures we saw for various workloads and use that, and I think you could definitely do that and there's a straightforward way how to do that. But I looked at lot of this while I was at Google, and they told me their request structures change often enough because they're a bunch of uncoordinated teams changing things at different levels. I think table guys might push on upgrade randomly. They didn't think it was worth to take the effort to do that because they didn't think the models would last long enough. So that's why I didn't do that here. So is there another question? All right, so once we identified the [indiscernible] categories that contained structural mutations, and the categories of donated precursors, we still have this mapping problem, right? We want to identify which precursors are likely to have donated how many requests to which structural mutation categories. 18 So we did this using a combination of three heuristics so we use a weighted assignment scheme that uses the three features shown on this graph. First, the overall type of the request, read versus write. So we assume that reads can't turn to writes, for example. That's a very basic assumption, right? Second, we look at the count of requests in each category. So precursor cannot donate more requests than it's decreased by in both period. Here NP shows non-problem. P shows problem. So this precursor can only donate about, what, 850 requests or so. And this mutation can only accept up to whatever 917 minus 480 is there, right. And third, structural similarity. We assume that precursors are likely to turn into requests that look more like them, that the changes will be localized, essentially. So we use these three features to create a weighted assignment of how many requests the structural mutation category can accept from a given precursor. Here's an example of a structural mutation again. Sorry. >>: You said you make a weighted assignment. Does that mean that you personally had to assign some arbitrary weights? Is there any meaningful way to compare those, or you just kind of have to [indiscernible]. >> Raja Sambasivan: You have to take a bunch of features that -- so the question is how did we pick these heuristics, basically, right? >>: Just basically heuristic in terms of -- is there a sensitivity analysis there? >> Raja Sambasivan: So there isn't really, I mean, there isn't a perfect way to do this. It isn't possible to figure out what likely turned into what else, right. >>: Is there even a way to know what the -- the assignments that have been made are fairly insensitive to change or if you're ->> Raja Sambasivan: Right, right. The only answer I can give you there is it's work in practice across all the problems we looked at. And the assignment is simply, it's a linear thing. So there's the three features, they're weighted equally, and if your precursor mutation looked very similar 19 structurally, then, you know, your weight you give that particular relationship is higher. >>: They can't be weighted equally because they're different units. >> Raja Sambasivan: They're normalized equally. So I guess, sorry, yeah, you're right. I mean, the only thing that determines how many requests are contributed is the structural similarity, I guess. So sorry, I managed to open up a -- so here, we had three features, right. The number of requests in each category just determines the upper limit of how many requests can be contributed. Second, the type of the request, we assume that reads can't turn to writes and the third is structural similarity, right. So the first two are basically limiting cases and the third thing determines how many requests you contribute. So here's an example of a structural mew cation precursor here. Once again metadata prefetching problem that I showed you earlier in the talk, where mutations with these requests accessed a metadata server and incurred many of these repeated database locked accesses. Whereas this precursor where it requested access to the metadata server and incurred many fewer of these database accesses. And once again, for structural mutations, the root cause is localized by identifying how the mutation and how the precursor differ from each other, because this is what gives developers a starting point for the diagnosis efforts. So we've identified the total set of categories that contain the response time mutations and structural mutations, we still need to rank them by their contribution to the overall performance change. And so ranking is useful for two reasons. First, there may be more than one problem in the system. And in that case, this ranking is useful to help focus the developer effort in the problems that affect performance the most. The second reason is because one problem can actually create many different mutations. So consider the case of a bug in a cache that causes more requests in missing that cache in the problem period, right. So great, one, mutation 20 requests a miss in the cache versus hitting that cache previously. But those same requests are going to go and affect the next level of resource in the system, perhaps the next level of cache, causing more requests in that cache. So you have a second structural mutation. So you see that there's a whole host of structural mutations introduced by one problem and so to help developers navigate through this entire host of mutations, ranking is useful. The ranking scheme is very simple. Intuitively, it's just the number of requests requested times the change in response time. So for response time mutations, there's a number of requests in that mutated category times the change in average response time between the problem and non-problem periods. For structural mutations, it's the number of problems affected, the number of extra requests in the structural mutation category times the change in response time between the structural mutation category and its precursor category. >>: -- So the rank is not at all based on statistical significance that you got >> Raja Sambasivan: Statistical significance is used as a filter at the very first step, so if ->>: So if something is under five percent, it's in, if it's over five percent, it's out? >>: Other than that, it's just straight? >> Raja Sambasivan: Yeah, so it's a top level filter, right. I can talk a little bit about that later in this talk. >>: Work flow. And [indiscernible]. >> Raja Sambasivan: Yeah, it might be, actually. We can either talk about this offline or later in the talk, but I have some slides on visualizing these results and this comes out there. All right. So far, I've described how we've obtained the non-problem period graphs and problem period graphs. How they're categorized together into these groups, how these categories are analyzed to see if they contain mutations, either structural or response time, and how they're ranked together, based on their contribution to the overall performance change. 21 The next step of Spectroscope's work flow is to present these results to the user in hopefully a pretty user interface. Hopefully. So I believe visualization is really important for automatic problem localization tools, like Spectroscope. These tools don't get to root cause of a problem automatically. Rather, they only localize the problem and they leave that last logical leap of getting to the root cause to the developer. So it's really important for them to present the results well. Here's a screen shot, a real screen shot of one of three potential user interfaces we're developing for Spectroscope. We just started working on it. But I wanted to show you guys a real picture. The UI shows a ranked list of the different mutations. The numbers are just IDs of the different mutations. But the key thing is they're ranked according to their effect on the performance change. Users can click any of them. Here, we selected the first highest ranked one, and the user can also select whether he wants to see the mutation or the precursor. In this case, we selected the mutation. And the graph shown is the request flow for the mutation. User can also decide to see the precursor instead. In this case, the graph will change. With this interface, we also have this nifty little start button that will quickly animate between both precursor mutation. So it will tease out the structural changes. In fact, I'm so excited about this work that I have another slide on it. So let me start this thing here. So what you see here is the mutation for the metadata prefetching problem that I've been using as an example throughout this talk. So the tool will show you the mutation and then users can scroll through it, and they might see hey, look, there are a lot of these accesses here. You see the green boxes, MDS lock acquired, MDS lock released, right. And seems to keep going for a while, because we have these locks instrumented. And in the end, there's some other activity, RPC to another component shown in yellow. We come back to the metadata server, see all the metadata locks again. So at this point, you might want to scroll back up to the top and say oh, look, did this happen before, in the precursor, before this performance change? So what we'll do in this case is maybe center ourselves on one of these locks, we start the animation, and then we see that these locks just seem to disappear. 22 And what's interesting is that the request seems to other instrumentation points move up. So you might hypothesis by zooming out of it, which I think will yeah, look, the graph does get much shorter without And so maybe this convinces you this should be ways efforts for this particular problem. get shorter. You see the want to confirm the happen in a second. See those locks. Check it out. to start your diagnosis All right. So that was the work flow for Spectroscope. Next, I about some case studies of using Spectroscope on real problems. the really fun part of this talk. So we've performed eight case of using Spectroscope to diagnose real and synthetic problems in Minor and Google. want to talk And this is studies so far both Ursa Most of our experiences using Spectroscope are qualitative. They're experiences of real -- they're real experiences of developers debugging real previously unsolved problems using the tool. But we also wanted to get a quantitative sense of Spectroscope's utility. So for the Ursa Minor case studies, we compute two metrics over the ranked list of results that it generates. First, the percentage of all the results it generates that are relevant or useful for debugging the problem. Second, the percentage -- the first is the top ten results. The percentage of the highest ranked ten results that are useful for helping developers debug the problem. And second, the percentage of all results that are useful for helping developers debug the problem in hand. The first metric is clearly the most important. Developers are going to investigate to get highest ranked items first. So it's important that those be useful. It's kind of like searching on Google. But the second gets that overall idea of the robustness of the output returned by the tool. And I just want to be full [indiscernible] about the word relevant here. So when I say relevant for problems we haven't diagnosed before, what we do is we have a developer or we ourselves would go in and use Spectroscope to localize the problem to just a few components, a few interactions and use that to get to root cause. And once we identify the root cause, we go back and say oh, look, here are all the results that Spectroscope generated. Did they lead us in the right direction. Did they identify the right changes that were on the track towards the root cause. 23 So here are the quantitative results for the six problems we diagnosed in Ursa Minor. The X axis shows the different problems. And the yellow bar shows a percentage of the top ten results that are relevant. And the blue bar shows a percentage of all results that were relevant. And you see here, most cases, our results were pretty good for the configuration problem. You see 100% of the results returned were relevant for helping debug the problem and 96% of all results were relevant shown by the blue bar here. In the worst case, I believe only 46% or 50% of the top ten results were relevant, and only about 40% of aug results were relevant in debugging the problem. But even in that case, the top ranked result was something that was useful in helping developers get to root cause. Once again, it doesn't really matter if you guys know the names, what these names mean, just that they're different problems. And this last problem, the spikes problem here has a not applicable in front of it because it was an example of using Spectroscope to show that nothing essentially changed and telling developers that they should really focus their debugging efforts elsewhere, in the environment, for example. >>: [inaudible]. >> Raja Sambasivan: For the spikes one? So it was an interesting case study. So what happened in this problem is that we noticed that every two weeks or so, one of our bench marks that we ran, we run these nightly benchmarks in Ursa Minor. Every two weeks, one [indiscernible] run much slower. And only every Sunday, every other Sunday, right. And so we used Spectroscope to compare request flows between a slower run of that benchmark and a normal run. And it generated nothing. Like there were no timing changes internal to the system. There were no structural changes, so the axis we were considering for changes showed no difference. And so we thought okay, maybe let's take Spectroscope at its face value here and say make it isn't the system itself. Maybe there's something else going on. And we eventually correlated to scrubbing activity from the net filer that the client was running on. So what was going on was that CPU activity was causing requests to be issued more slowly from the client and the arrival time 24 requests was different. All right. So as I said, our experience with using Spectroscope is qualitative, they're examples of real people, real developers using Spectroscope to diagnose real problems. So I want to give you guys an example of a work flow of how Spectroscope fits into the daily routines of developers. And so I've actually already described the prefetching problem here and that's the case study I've been using throughout this talk. I'd also like to talk about this configuration problem. I want to point out before we move on that all of our Ursa Minor case studies used a five component configuration of Ursa Minor with one client, one NSF server, one metadata server and two storage nodes. One storage node was used for storing data and the other one was used, was dedicated to metadata server for storing metadata. All right. So at some point a developer of Ursa Minor committed a pretty large piece of code that caused a huge slowdown in a variety of the ten benchmarks we saw in the system. So the question is what was going on here? We used Spectroscope to compare request flows from after that are he gression so before the regression and we found that Spectroscope identified 128 total categories of mutations. It turns out most of these categories showed the same structural change. Where in the non-problem period, the precursor, in the precursor request flows, the metadata server was accessing its own dedicated storage node to read and write metadata. In the problem period, the metadata server was instead using the data storage node for most of its accesses and this was increasing the load in the system slightly, and so performance slowed down. So this is great. We've used Spectroscope and localized the bug to an unexpected interaction between the metadata server and the data storage node. But you know what? It's still not clear why this problem is occurring. This isn't the root cause, right? Why is the metadata server all of a sudden using the wrong storage node. And so we had this hypothesis that the mutations, the change in the storage node access was perhaps caused by change in low level parameters. Perhaps the client had changed some configuration, was sending some function called parameters that simply said used a different storage node, perhaps changes in the RPC parameters or in function call parameters somewhere that were telling us to use a different storage node. 25 And so to investigate this hypothesis, we used a feature of Spectroscope that I haven't described yet. I'll quickly go over now. And what this feature lets you do is lets you pick a structural mutation category as a precursor and identify the low level parameters, a function called parameters climb parameters and so on that best differentiate the two. In order to do this, we use a very classic machine learning technique, regression trees. So when we picked the highest ranked structural mutation category, and its precursor category, and we ran it through this feature of Spectroscope, we found that the Spectroscope immediately identified changes in the fault tolerance scheme, the encoding scheme for data, and things like that. It turns out that these parameters are only set in one file. It's this really obscure configuration file that has never changed, except for the developer changed it just for the purpose of testing and had forgotten to change it back. The interesting thing about this is this configuration file does not live in the virtual control system. So no one even knew it had changed. So this was a root cause to a very obscure configuration file in the system. There are seven more case studies in the NSDI paper we did. If you guys are interested, I encourage you to have a look at it. It's really great, because it reads like a mystery novel. So if you guys like mystery novels, this is really the paper to read. But before I conclude, I want to talk a bit about my future research agenda. All right. So to motivate my future research, I want to place the work I've done so far in the context of modern data centers and clouds. So we all know these environments are growing rapidly in scale. They contain many complex distributed services that depend on other complex distributed services to run. So, for example, at Google, a distributed application or dependence configuration, a table store such as a big table such as distributed file systems such as GFS, the scheduling software decides where to place different applications, the authentication mechanism, the network topology and so on. So if you guys thought problem diagnosis was hard so far, say based on this talk, it's only getting harder. 26 And so a lot of experts believe that because it's increasing difficulty, we need -- there's this need for creating self-healing systems that are capable of fixing a lot of problems automatically without human intervention. And this also forms the basis of my long-term research goals. So here's my road map to creating these self-healing services. This is at least three steps, right? The first step is to continue to create techniques that help developers diagnose and understand complex distributed services. For example, by automatically localizing source of a new problem to just a few relevant components, right. Developers possess a large amount of knowledge and insight, which they synthesize to diagnose complex problems. So hoping to replace immediately just isn't a feasible approach. The second step is I like to call it this yang to the first step's yin, I guess. In that what we all seem to do is need to learn how to build more predictable systems. The first step says learn how to debug and understand complex systems. The second step says hey, look, there might be simple things we can do that can help us build systems that are easier to analyze and easier to understand. A simple example is many automated problem localization tools like Spectroscope expect variants to be bounded or limited in certain axis to work. The system's predictable in this respect. And so a tool that helps developers find unwanted sources of performance variants in the systems and potentially move them if they think it's appropriate would be something that would help these automatic localization tools work so much better. It's just one simple idea there. And the third step is use the techniques that work best from creating these tools to help developers and use them as a substrate on which to build further automation. For example, automation that takes automatic corrective action, which is the next big step towards self-healing systems. A large portion of my graduate career so far is focused on this first step. Request flow comparison, my thesis work, is the obvious one. I also studied what knobs and data I need to expose by distributed system in order to enable better tools and more automation in the context of building distributed storage system called Ursa Minor. For example, I helped build the end-to-end tracing system that was used in Ursa Minor for a lot of these case studies. 27 I helped create parallel case, which is a tool for helping identify dependencies between nodes in parallel applications. And then I also helped create a tool for predicting the performance of workloads that moved across storage devices, which is our work in symmetrics a few years ago. But there's still a lot more to be done here. I think one key area is tools for automatically localizing problems in what I call adaptive cloud environments. So these environments essentially schedule workloads so it's increased utilization of the data center as much as possible. So to satisfy global constraints, they may move running workload from one set of machines to another, which are running a complete different set of other applications than the set you were running with before. If the amount of resources available in the system is high, because there are not many other things running, they may give you the lot of resources. They may give you a lot of CPU. They may give you a lot of memory, only to steal back those resources the minute some other application needs them. So all these things are great. They increase utilization in the data center a lot, but they do so at the cost of predictability and variance. And there's a lot of existing localization tools out there simply not as useful in these environments. And so I think that for these environments, a new class of tools will be necessary, and these tools will have to have intimate knowledge of the data center's scheduler's actions. They'll need to know when the scheduler decides to move an application from one set of machines to another so as to note expected performance change and so as to know how much of a performance change to expect. For similar reasons, they'll have to have knowledge of the data center topology and the network. And so I'm interested in building these tools in your future. I'm not going to talk too much about the second step here. Instead, I want to spend a quick slide on the third step. So to get us to where it's taking automatic corrective actions, I think a bunch of sub-steps or courses of action are necessary. First, we need to catalog actions that developers take when diagnosing performance problems. Some actions might include moving workloads to different machines, reverting code, plus adding resources and so on. Next, we need to find good machine learning techniques to help map between 28 these problem symptoms and potential actions. And third, and I think this is the most important step, is we need to create models that will help us estimate the cost and benefit of taking these various actions. For example, would you pick an action that is guaranteed to fix a problem, say in Amazon, EC2, right, but will result in guaranteed down time, or would you pick an action that may or may not fix the problem. Say it's 30 percent guaranteed to fix a problem, but will result in degraded performance for, let's say, six hours or three hours or so. How do you choose between these different options? And I think various models, economic scenarios and so on will be necessary to help us gauge these different potential actions. And I think once we have some initial work here, we will really be on the path towards creating tools that take these actions. So I'm done with this talk. To quickly summarize, I described request flow comparison, my thesis research. I showed the effectiveness of request flow comparison by showing how this is used to help diagnose real, previously undiagnosed problems in two systems, Ursa Minor and certain Google services. I'm also happy to report that one startup company I've talked to has actually implemented a request flow comparison and are considering using it in a product which I thought was really cool. And finally, I described my plan for achieving the goal of self-healing systems of which request flow comparison is the first step. I believe future research in this path will require strong collaborations between systems people like myself, machine learning experts, statistics experts, and visualization folks. And so I look forward to working with them towards this larger goal. Thank you. >>: -- So would you talk about your decision to use the statistical significance >> Raja Sambasivan: Oh, yes, I forgot about that. So yes, you're right, the initial Spectroscope ignores anything of these timing changes aren't statistically significant, right. And it turns out that that sounded like a good idea when we started, and you still think that's a reasonable approach. One of the things we ran into is we knew the visualizations. We actually ran a user's study with the different visualizations to compare them. And what we found is that statistical significance is weird. You could have something that has a large timing change, really high variance, and that isn't 29 statistically significant. You could have something that could have a really small timing change but really low variance and it is statistically significant. So what would happen is that the users would see some of these things that weren't identified as statistically significant, automatically marked by the tool. And because they weren't experts in statistics, they would start wondering why is this tool showing this change but it isn't doing that change. And so we started out saying, like, false positives are really important. We want to avoid them. Turns out false negatives are really important too, because they convince people that perhaps their mode of thinking is not in line with the tools and maybe they're doing something wrong or maybe the tool is doing something wrong. So I think moving forward, we need to modify the test we use a bit to account for these issues. >>: Modify in what way? >> Raja Sambasivan: Well, you can imagine weighting our statistical significance of that if, for example, you do see a timing change that is really large. But has high variance. Maybe you identify that anyway. So maybe you have a continuous spectrum of normalized values that says, yeah, statistically, this isn't significant, but accounting the magnitude of the change, I want it identified anyway. You might not identify really small timing changes as statistically significant. So I think some kind of normalization factor that accounts for the magnitude would be useful. >>: Should that happen on a parametric test? >> Raja Sambasivan: Why not. >>: For instance, if I'm using a [indiscernible] to compare two sets, all it cares about is the fraction of one set that's greater than the other set. So you tyke two, A is greater than B, and so on and so forth. So, so long as it's not a matter of the variance as much as whether one set tends to be greater than the other set. Seems to be -- 30 >> Raja Sambasivan: Yeah, so that could be true. But then the magnitude of that change would still matter. But I might be interpreting your question incorrectly, but it seems like you could still have lots of timing changes that are very small that perhaps you wouldn't care about and still one set would be greater than the other set. It's just that those changes were small enough not to really matter. >>: A non-parametric test is one where you're not assuming any distribution to variance is generally something that you see in a test that assumes a normal distribution. >> Raja Sambasivan: offline, I guess. I don't think I agree with that. But we should take this >>: So what fraction of the real diagnoses that you guys played with required the structural limitations? Was one of those tools [indiscernible] or were they both critical? >> Raja Sambasivan: That's a good question. So the question is, were there things with more structural mutations or response time mutations. In Ursa Minor, a system still in development, we saw a lot of structural changes and I think that's because things are changing often. People are putting in different algorithms, different cache replacement policies and so on. At Google, we saw more timing changes, but also their tracing [indiscernible] wasn't as detailed as the one we had in Ursa Minor so it might be that it was just masked. So it's hard to say. One thing we found is in this talk, I talked about structural limitations and response time mutations being distinct entities. In reality, you usually have structural mutations and timing changes. >>: In the same category? >> Raja Sambasivan: >>: [inaudible]. In the same category, right. 31 >> Raja Sambasivan: Yeah, that's right. In fact, one approach I've been thinking about for instrumenting larger systems, like we instrumented HDFS of using end-to-end tracing and we found that because of the granularity of the rights that are issued, which are usually like 64, 128 megabyte chunks, which are then broken down into like 64 kilobyte packets and pipelined to three different replica, we get really large [indiscernible] from them, because we end up, the raw instrumentation, if you want to capture low level activities, captures a 64 kilobyte packet. So we end one graphs that natively have like 5,000, 6,000 nodes in every single write. So among the approaches we're considering is this zooming approach. We actually exclusively exploit this notion that timing changes turn to structural changes. You start out ->>: You turn the knob at [indiscernible] time rather than at collection time? >> Raja Sambasivan: Or the tool automatically does that. Starts out with a much lower granularity graph that is relatively small, it finds change in specific areas and zooms in just those areas and exposes detail. >>: Exposes structural changes. >> Raja Sambasivan: Yeah, that's exactly right. >>: [indiscernible] basic performance issue can be very complicated. In some of the cases, [indiscernible] depending upon workload and [indiscernible]. >> Raja Sambasivan: So you say the path depending on [indiscernible]. >>: Yeah, [indiscernible] performance issues and really fine [indiscernible] performance issue because of the this workload characteristics [inaudible] cases we see is [indiscernible]. Can you comment on how your tool is going to help [indiscernible]. >> Raja Sambasivan: Let me clarify the question a bit. Are you talking about dependencies between requests, or are you talking about large write ->>: I'm talking about maybe in the [indiscernible] systems. kind of load. The systems performance may be different. Depending on what 32 >> Raja Sambasivan: Right. >>: And there may be slight change of the load from the [indiscernible] point of view who saw them as similar, actually [indiscernible]. >> Raja Sambasivan: Right. So yeah. So let me answer the first part of the question as differing load. One of the things we looked at adding to this things is a notion of simple queueing theory. So you can imagine saying that the expected response time of different types of requests should scale linearly or sub-linearly with load and to a certain point at which it's bound to increase rapidly. So we looked at incorporating that model in and not identifying some as a mutation if a load is greater and the performance is only increased by a relatively small amount. That's accounted for by the queueing model. So that's one way of accounting for small load changes. I'm not sure if I got the second half of that question or maybe I even ->>: May be a real basic problem. [indiscernible] if a system is setting [indiscernible] if you're setting mostly large read, when you're sending small plus large read, big performance issues for large read [indiscernible]. >> Raja Sambasivan: So yeah, so you're talking about dependencies between things, right. Resource contention is another example of this, right. So certain slowdowns might not be caused by the system software. It might be caused by another request another client is trying to access that same resource. The question is what do you do there. This tool currently will not identify that as a root cause. It will show you that something has slowed down. It won't say it's because of some other request. We have looked a at -- that's actually one of the key pieces of future work we're looking at, how do you deal with things like resource contention. Could you extend the tool that says here's a slowdown, here's a structural change. The reason it occurred is because there's other requests. This other client is trying to use the same resources or interfering badly with this request. >>: Your answer is only [indiscernible] resource. There are other issues 33 beyond resource. [indiscernible] has proven itself is complicated. Depending on [indiscernible] workload put in also depending on the physical characteristics of the underlying device, there may be issues. And maybe the issue only [indiscernible] specific combinations. >> Raja Sambasivan: >>: Yeah. And that combination actually is [indiscernible]. >> Raja Sambasivan: An example of this is configuration, for example. Yeah, this tool, what you really want there is a tool that, I think what you really want there is a tool that explores the different configuration options. This tool will not identify that automatically for you. It won't say that these specific combinations of things, if you ran them, are bad. You'd have to explicitly run them. And then look and say oh, they're bad. That's an interesting space of research there that this tool doesn't cover. >> Jon Howell: One more question. >> Raja Sambasivan: Sure. >>: I'm trying to figure out the purpose of this statistical significance testing rather than, say, looking at the change of where the fraction of my total time has gone. And I'm thinking about the situation in which you've got a caller that maybe takes a million units of time to do something and something that calls that takes one unit. And if we realize that we can make a change where the callee is now going to taking from one unit of time to ten, but we're going to save 100 units of time on the caller, so it's a net win by a factor of ten. But when you look at the callee, it's gone from one unit to ten units, big statistically different change. The caller is going to have -- is going to go to a million, to a million minus a hundred. So that's going to be lost in the noise. So the statistical test will say we've just made things a lot worse in the caller. The callee will be lost in the noise. And something else along with this improvement that screws something else up somewhere so that it looks like this is -- so the system is now performing worse. Won't this send me to exactly the wrong place? >> Raja Sambasivan: Well, no, because first of all, we're running the original [indiscernible] test. I'm not sure I'm answering this question correctly, but 34 my answer be would we're running the original statistical test on the end-to-end timing, right. So their sum time is a sum of all the other times. So we've identified that request flow as being a mutation or having a change because of the end-to-end timing change. And then we're looking at each of the individual components. Each of the individual edges. We're running the same test and we rank the ones statistically significant, the edges as statistically significant or the interaction as statistical significance by their timing change. So you say this thing changed by a thousand units, and there are these ten interactions that changed within that request flow statistically. And these are the ones that affect that end-to-end timing the most. And so you do have this drill-down process, I guess. So you're initially looking at the entire coherent picture and then you're looking at individual pieces of it and seeing how they contribute to the overall picture. So there's a correlation between all the times, I guess. Maybe I'm not answering the question. >>: I do not see where you're running the test. >> Raja Sambasivan: So the tests are run at multiple levels. They're run an at the end-to-end time which incorporates all the other times and they're also run in all the pieces that build that up. So you can ->>: The end-to-end time of this request were bad because there's something messed us elsewhere, then you go in and look at it. >> Raja Sambasivan: This tool will not identify any of the edges. If there are cross dependencies like that, like for example resource contention might cause this, what you do is you might identify the request flow as having significant timing change, but none of its specific interactions might be identified as contributing to it. That might be one use case. More likely, there will be a bunch of edges that have changed as a result of contention. The high level, what you're saying you might see the end-to-end timing change, but you're saying none of the interactions within it seem to have changed because something else is causing the problem. That's the high level statement you're making, right? Is that the case you're considering? 35 >>: We'll take it offline. >> Raja Sambasivan: >> Jon Howell: Okay. Thanks.

1

Related documents

Products

Support

1

Related documents

Add this document to collection(s)

Add this document to saved

Suggest us how to improve StudyLib