1

advertisement
1
>> Jon Howell: It's my pleasure to introduce Raja Sambasivan from CMU. He's a
Greg Ganger student and he's at the parallel data laboratory, and he's spent a
lot of time building distributed systems, which leads him to the problem we're
going to learn about today, which is diagnosing the problems that happen in
them. Give him a moment to get the recording mic on.
>> Raja Sambasivan: All right. So my name is Raja. Today, I'm going to talk
to you about a new technique for diagnosing performance problems in distributed
systems.
All right. So what's going on, why give this talk? Well, as probably a lot of
you know, diagnosing performance problems in large distributed systems is just
really time consuming and just really difficult. And this is because the root
cause of the problem can be any one of the many components, sub components, or
functions of the system.
So one of the things we clearly need are tools that will automatically localize
the source of a new problem from these many components to just a few relevant
ones. Request flow comparison. The focus of this talk, is a technique I've
developed for automatically localizing the source of performance changes. It
relies on this insight that such changes often manifest as mutations or changes
in the path request take to the system; that is, the components they visit, the
functions they execute and so on. Or in their timing.
So exposing these differences and showing how they differ from previous
behavior can localize the source of the problem and significantly guide
developer effort.
All right. So to make this approach a bit more concrete, I want to describe my
personal experiences debugging a future addition to a describe distributed
storage service called Ursa Minor. This storage service is comprised of many
clients, many NSF servers, many storage nodes. Did I say metadata servers?
Okay. Metadata servers too. But for the focus of this talk, I want to
concentrate on a case that was just one of each and two storage nodes. If
there are any of you who are not familiar with distributed systems terminology
or distributed storage terminology, the names of these components don't really
matter, just that they represent different things in distributed systems.
All right.
So in this system, to access a file's data, requests must first
2
send a request to the NSF server. The NSF server must look up the location of
this file's data and things like access permissions and so on by sending a
sub-request to the metadata server. This metadata server access might itself
incur a storage node access.
And once the NSF server has the metadata or the location of that file's data,
it's free 0 access the storage nodes directly to obtain that information and
then respond to the client.
So there are two key things to the nodes here, right. First, all this work
done, you know, within the components of the system across the components and
so on is the work done to service this request. The work done to obtain the
data the client requested. This is also what I mean when I say the path of the
request of the system or the request flow. It's this work flow that you see
here for unusual requests, right.
Second thing to notice is that in this architecture, every single file access
from the client requires a metadata server access, right, and hence a network
round trip, right.
This isn't that big of a deal for really large file workloads who are accessing
a few really large files, because the cost of the single metadata server access
is amortized over many storage node accesses, right. So it's not that big of a
deal.
But it's a much larger problem for accessing a lot of really small files.
Because then the cost to this metadata access becomes much more prevalent. So
for a small file, maybe there's only one storage node access and one metadata
server access. So the cost of this metadata server access is much more
prominent and, you know, it's much more visible than the end tone latency.
So being a new grad student at the time, I just started seeing, I thought I
would help out. So I thought I would add a feature to the system that would
make things better for small file workloads. We added a very simple feature
called metadata -- server driven metadata pre-fetching, right. And it's a very
simple feature, and all that happens is when the NSF server has to access the
metadata server to obtain the location of file's data, the metadata server will
turn additional related locations or metadata for files that it thinks we
accessed soon in the future.
3
So in this way, hopefully you're limiting future accesses to the metadata
server and improving the performance of the small file workloads. So with this
future addition, clients would access the metadata server -- sorry, the NSF
server. The NSF server would already have the metadata at a location that the
files data prefetched and then so the NSF server could access the storage nodes
directly and then respond to the client.
All right. So you think this would be faster, right? I mean, there are less
arrows. It really should be faster, right? But as it turned out, adding
metadata prefetching killed the end time performance of a lot of the nightly
benchmarks we ran in the system. We had no clue why. What's going on?
And so I want to take a quick break here and I want to ask you guys about how
you'd go about diagnosing this problem. What would you do? Make parse a
couple log files, add printouts here and there, maybe use GDB in some of these
components, right? Take a second.
All right. I want to quickly go over how I initially tried to solve this
problem. I worked really, really hard trying to figure out what was going on.
This guy here is drinking coffee. In my case, it would be diet cokes, right.
Second, just give up. And third, convince myself I was never, ever going to
graduate. And it still might not happen. Who knows.
On the other hand, request flow comparison, this technique I developed or
diagnosing performance changes, would have immediately identified the mutation
or the change that gave rise to this performance problem. It would show that
certain requests observed after the future addition; that is, after this
performance change accessed both the NSF server and the metadata server but
incurred many repeated database lock accesses, several levels of distraction
below where the prefetching function was added.
We show these extra database lock accesses occurred significant amount of time,
here about three milliseconds. We also show the precursor service mutation;
that is, the anticipated flow of the same request before the performance change
and before the future addition. We showed these requests also accessed the NSF
server and the metadata server, but incurred many fewer of these database lock
accesses.
So by comparing the two, and seeing the difference between them, perhaps I
would have gained intuition necessary to get to the root cause of this problem;
4
that is, the cost savings generated by prefetching was outweighed by the extra
cost of doing the prefetching itself.
So here we see that request flow comparison helped us out by showing how this
changed flow that was much slower, differed from the original flow. And gives
a starting point for our diagnosis efforts.
So in summary, our approach of comparing request flows identifies distribution
changes in request flow timing and structure. And in doing so, localizing a
new problem to just the components that are responsible.
Now, request flow comparison can't be used to diagnose all sorts of problems,
right. For example, it can't be used to tell why performance has always been
slow in a system, but it does have many use cases. For example, you can use it
to diagnose performance regressions. You can diagnose regress performance with
the -- go ahead.
>>: So if you had been perfectly omniscient, though [indiscernible] would you
then have had a win?
>> Raja Sambasivan: So the question basically is why was this a performance
problem? It was a problem because we were prefetching long items, for example,
right. Yes, that was one of the items. We were perhaps about 75 percent to
maybe 80 percent efficient in prefetching things, but there was another
architectural issue, and that's the metadata server didn't know what -- wasn't
keeping track of what had already been prefetched, right, and so it would
sometimes retrieve duplicate items.
So I guess the take-away there is yes, there's entire hosts of things after the
problem localization that a developer has to keep track of and understand to
get to root cause, right. But the tool here was giving me a starting point for
the efforts.
So this said you could use request flow comparisons to diagnose regressions.
You can compare request flows generated from after the regression or the
slowdown and compare it to the request flows generated before the slowdown to
understand why performance is so much slower during the regress period.
Another interesting use case of request flow comparison is to diagnose workload
degradations within a single period. You can imagine extracting the last
5
thousand requests seen for a workload and comparing it to the first thousand to
see why the last thousand was so much slower.
A third interesting use case of request flow comparison is to eliminate the
distributed system as a root cause. You can imagine comparing request flows
between two periods and finding that the tool says that nothing within the
system has changed. There are no changes in timing or in the path to the
system, in components accessed, the functions executed and so on. And perhaps
that would give you some intuition that problems not internal to this system,
perhaps caused by something external. Perhaps external processes, something in
the environment and so on.
A final point here is that request flow comparison is a first step towards this
larger goal of creating self-healing systems that are capable of diagnosing
problems automatically without human intervention and I'll get back to this a
bit later at the end of this talk.
All right. So the rest of this talk, I'm going to talk about an implementation
of request flow comparison and a tool I built called Spectroscope. I'm going
to describe case studies of diagnosing real, previously undiagnosed problems in
a distributed storage service called Ursa Minor, which identical to the
architecture I showed you earlier in slide 3 of this talk.
We also used request flow comparison to diagnose problems in certain Google
services, though that isn't the focus of this talk. Finally, I'll end in my
road map for achieving this goal of automated diagnosis and self-healing
systems.
So getting to the more fun parts of this talk. In this slide, I'm going to
show the work flow of request flow comparison and how it's implemented in this
tool I built called Spectroscope. And I'm going to use this as my outline
slide for the majority of this talk.
So the Spectroscope takes as input request flow graphs, right, from a
non-problem period and a problem period. It bins similar requests from both
periods into the same category and uses these categories as a fundamental unit
for comparing request flows.
Using statistical tests and various other heuristics, it identifies those
categories that contain performance affecting change, which we call mutations.
6
We consider two types. Response time and structural.
these in more detail later in the talk.
And I'll talk about
It takes these categories that contain mutations and it ranks them according to
their contribution to the overall performance change so as to guide developer
effort. And finally, it presents results to the user in a nice user interface.
So for the rest of this talk, I'm going to talk about the heuristics for each
of these components for Spectroscope's work flow. In choosing these heuristics
and algorithms, we tried to choose ones that were simple and ones that limited
false positives, which we feel are probably the worst failure mode of an
automated diagnosis tool like Spectroscope because of non-developer effort.
All right. So first let's talk about how we obtained these graphs. To obtain
request flow graphs, we leverage end-to-end tracing techniques which capture
the control flow of individual requests within and among the components of the
distributed system. End-to-end tracing has been described extensively in
previous research. For example, in mag pie, star dust, X trace and it's even
used in production Google data centers, via Google dapper.
And for those unfamiliar with this, it works as follows. Trace points are
either automatically or manually inserted in key areas of distributed system
software. Now, they can be automatically inserted at component entry and exit
points, usually. For example, if there's a shared RPC library, you can add
instrumentation there and now you have the entry and exit points of your
components of the system instrument automatically.
And then once you have the component level instrumentation, developers are
responsible for adding instrumentation points within components as they see
fit. And when I say instrumentation points and trace points here, just think
print def. Just a simple log statement saying where you are in the custody and
what component you're in and what behavior that's being described.
So once you have these instrumentation points in the system, doing run time,
the end-to-end tracing system keeps an ID with each request that's observed.
And it associates this ID with logs of these trace points that are captured at
the various components. And what happens is offline or later during execution,
these IDs can be used to tie together trace points accessed by unusual requests
and create these graphs at the request flow within and among components of the
system.
7
So end-to-end tracing can be implemented with very little overhead, less than 1
percent, as long as request level sampling is used, in which a random decision
is made at the entry point of the system whether or not to capture any trace
points or any instrumentation for that particular request.
So this is an example of a request flow graph for a read operation in the
[indiscernible] storage system. In this case, the data is actually stored in
two different storage nodes. In a real system, it might be stored over many
more but I can't get everything to fit in one slide here.
>>: Is 1 percent [indiscernible] performance flows most of the time and then
every once in a while ->> Raja Sambasivan: Yeah, so the question is one percent sampling enough. The
answer is it depends, right. So one thing that that Spectroscope doesn't try
and target are anomalies. It looks for distribution changes and timing and
topology or structure requests. So we're not going out to really rare cases to
start off with.
Second, you are right. The actual sampling rate that is required a trade-off
between or is a compromise between the workload size and the amount of data you
need to capture for statistical confidence. And that depends both on
performance variance and also the incoming request rate. So we have a really
large workload that has very little performance variance and so you can -- in
the best case, it's deterministic and you capture one sample.
Or you have a really small workload with lots of performance variance and you
capture around 1 percent. You can actually do the math to work out what you
need to sample. So it's not, you know, just blind.
>>: So this random sampling, does each entry make its own sampling
[indiscernible].
>> Raja Sambasivan: Yeah. So the question is do individual trace points make
their own decision or is it correlated globally. The answer is it's correlated
globally. So it's a decision made at the very first trace point, at the entry
point the system, the top level of the system.
>>:
[inaudible].
8
>> Raja Sambasivan: There's a special ID that you don't sample that's
propagated with the request. The library call sees that. It just doesn't
capture a trace record for it.
All right. So on this screen, you see an example request flow graph of a read
operation. So let's say there's, as I said, a read call. It enters the system
and so the top level component of Ursa Minor, the NSF server, logs this NSF
read call trace point. Next, let's say this request travels to the cache
component of the NSF server. And this isn't it, right. So here we see another
trace point that says we're missing the cache.
We also see that the request takes about 100 microseconds to travel from the
entry point of the system to the cache component. So next, the request travels
to the network layer of the NSF serve sore that two concurrent RPCs can be made
to retrieve the data that the client requested. And you would see to get to
this component, it took about ten microseconds. Next, we see two concurrent
reads to storage nodes on which this data is stored.
In a real request flow graph there would be additional nodes and edges showing
work done on each of these components. But once again, I need to make stuff
fit on the slide so everything is a tracked down to a tiny little bubble. I
had debated bringing a much larger projector for a real graph but I haven't
done that yet.
So we see the work done on storage node one finishes first. It takes about 3.5
milliseconds here. So the graph would show that this storage node replied
first. We see the reply at the NSF server and we also see the network time.
In this case, we have a really slow network so it took about 1.5 milliseconds
for the data to be transferred from the storage node to the NSF server.
>>:
So what are you seeing about [indiscernible].
>> Raja Sambasivan: That's always a good question. So we actually do not
assume [indiscernible] in our implementation. So when I show these times here,
what we're doing, actually, is we're creating a what happens before
relationship between the two components and then I'm capturing the end-to-end
time on the calling component and basically just dividing that by two and
placing it on each of the edges. So it isn't perfectly accurate, but it
doesn't assume end-to-end clocks.
9
Now, I was at Google for several months and I implemented all this stuff on top
of their tracing system. What I found really amusing is they did not care at
all. They didn't care. All their clocks had skew. They didn't care. They
saw a negative time on the edges, they would just reset it to zero and move on.
So I found this really interesting disparity between what they were doing in
production and practice.
>>:
They're not doing that because they're clowns or --
>> Raja Sambasivan:
They just didn't care.
It wasn't a big Dale.
>>: Did they not care because they're clowns or did they not care because
there's some deeper reason why it's okay to arbitrarily truncate the zeroes? I
don't know. I mean, to me, that seems like a bit of a silly risk because you
actually do want to capture these anomalies with some notion of time.
>> Raja Sambasivan: I think that for them, their network tended to be really
fast to start off with, and that did not tend to be the source of any problems,
many problems they ran into. If that became an issue, I'm sure they would take
care of it.
>>:
[inaudible].
>> Raja Sambasivan: Yeah, pretty much. I mean, and you know, in the case
where there was a really large slowdown, right, having synchronized clocks, you
know, you want your clocks to be synchronized to within a small amount, right.
But if your slowdown is like a second or something, it's still going to show
up. So you just didn't care.
All right. So sometime later, the second storage node replies. Once again,
the network transfer time is really slow. And I want to add that we had really
slow networks at CMU. So for us, this is a big deal. And once the NSF server
has all the data necessary to reply, the client, it logs one final
instrumentation point, saying it's done, and then this is the request flow
graph that you'd see for this request.
So a couple of take-away points. First, nodes of these request flow graphs
show trace points reached. There are basically printouts or log statements.
Just indicating what function the instrumentation point is captured in, which
10
component you're in and so on.
trace points.
And edges show latencies between successive
>>: When you request machines, does the library capture the very last moment
that the request occurs on, or are you incorporating [indiscernible] in that
network timing various queues in [indiscernible]?
>> Raja Sambasivan: That's a good question. So the basic question is how
close are we to the exit point of component at the -- yeah. The answer is we
usually instrument this at the RPC library so there's still a lower level of
kernel level things.
>>: We meaning [indiscernible] trace point in the library and that's as far as
you get?
>> Raja Sambasivan: Yeah, that's as close as we got. We're not going to
[indiscernible] to get, right. I know when I was at Google, they were
exploring options to actually instrument the kernel to capture things like
this.
>>: The network is slow because, you know, [indiscernible] big enough and
[inaudible].
>> Raja Sambasivan: I mean, you can imagine also, you know, if you care
enough, instrumenting those queues, right, and also capturing things like ping
times and things like that to figure out where your [indiscernible] product
should be, right.
So the next step of Spectroscope's work flow is -- well, to summarize so far,
I've shown how we obtained the request flow graphs from both a non-problem
period before a performance change and a problem period after performance
change.
The next step of the work flow is to group similar performing requests from
both periods into the same category. There's a lot of this identified
mutations or performance affecting changes by comparing per category and per
period distributions.
In our case, we choose to define similar as requests with identical structures
and topologies. That is requests for components execute the same functions,
11
have the same amount of parallelism, and so on.
I'm not going to go into too much detail about why we chose this particular
grouping, but there is a lot more detail in the RSDI paper about this stuff,
and I can also field questions here.
So once we've categorized these requests into groups, so requests on both
categories are grouped into the same category, the next step with
Spectroscope's work flow is to identify those categories that contain
performance affecting changes or mutations.
>>: Sorry, can you back up one slide, make sure I understand. The
categorization step, you put every request into a bucket labeled with the
topology [indiscernible] request.
>> Raja Sambasivan:
Pretty much, yes.
>>: So how can you ever have a structural mutation if everything's in the same
bucket? Mutations are -- is a difference between two things in the same
bucket.
>> Raja Sambasivan: So the question is how do you have structural changes if
you're grouping requests from the same topology to the same category. I'll get
to this in a bit more detail later. But just a quick high level answer is
timing changes compare stuff in the same bucket, the same category. Structural
changes compare stuff across buckets.
>>:
Oh, okay.
>> Raja Sambasivan: Diving right into that, the first type of mutations we
consider is response time mutations. And these are requests that are just
timing changes. They have the same topology, the same structure in both
periods. They visit the same functions. They access the same components and
so on. It's just that they've gotten slower. So to help developers diagnose
problems that manifest as this type of mutation, what we want to do is first we
want to identify that they exist. And second, we want to localize the problem
by showing specific interactions along these request flows that cause a timing
change. That slowdown.
Now, when comparing request flows that identify response time mutations and
12
localize them, we can't really expect response times to be identical between
both periods. So there's always going to be some natural variance in timing
between both periods. So one key thing Spectroscope has to do is to separate
such natural variance from true distribution changes or mutations. And to do
this, we use this [indiscernible] hypothesis test.
So these tests compare two distributions to determine whether there's enough
evidence to reject the hypothesis that both distributions are, in fact, the
same and just vary because of random effects. There is enough evidence,
usually about enough to guarantee a false positive rate of less than about five
percent, then the distributions are claimed different and we'll identify them
as containing mutations.
So here is a picture of the response time distributions for a one category.
And we see that the requests assigned to this category from the non-problem
period shown in the black line are very different from the distribution of
response times for requests assigned to this category for the problem period.
>>:
[indiscernible].
>> Raja Sambasivan: Yeah, the [indiscernible] response time, right. So you
see that they have very different means and the variance around the means is
very small. They don't overlap a lot. So it's very clear that these are
different distributions. They represent different behaviors. So here, we'll
be confident enough to identify these as containing mutations.
The second category is very different, right. You see that the response time
distribution of requests assigned to it from the non-problem period, shown with
the black line, is very similar to the response time distribution for requests
from the problem period. Their means are very similar. They overlap a lot.
They're very, very large.
Here, we won't be confident enough so we won't identify these as containing
mutations.
Now, a question you might have here is why not just use a raw threshold. Why
not just say requests that are twice as slow in the problem period. Should it
be identified as containing mutations. And the answer to that is that using
statistical tests limits the false positive rate. You are guaranteed a false
positive rate no more than five percent. So essentially, you're limiting
13
wasted developer effort by using these more robust tests.
>>:
What test guarantees that?
>> Raja Sambasivan: So we're using a non-parametric hypothesis test.
just using a [indiscernible] test.
>>:
Okay.
So we're
Doesn't that test assume that all of the data points are IID?
>> Raja Sambasivan: Yeah, there are certain assumptions in there, right. So
it does assume that the various request response times are, you know, are
identically independently distributed, right. In practice, that's held up
better than other tests, like, for example, we don't do things like assume that
the distributions are normal, for example, or Gaussian. So there are some
assumptions thrown in there. But in practice, it's worked out.
Now, there are cases where that assumption doesn't hold, I do agree, right.
But in practice, it's been okay.
>>: So I mean, one, perhaps, important assumption you can make this on is that
there aren't sources of [indiscernible] changes in the workload that cause all
of you're a measurements to be different than your B measurements. If the
network suddenly got chattier the second day after you made your
[indiscernible], it's going to shift a lot of these curves. Is there any
statistical technique you can use to detract that, or do you need to, or do you
want to?
>> Raja Sambasivan: Right. So I think there are multiple questions in there.
The first one is what happens [indiscernible] high level workload changes,
right. In that case, what this technique will show you is changes in the
workload or changes in timing and so on as a result of the workload change,
right.
>>: What if you say, oh, look, in every category there was a mutation. What
have you done? I mean, how will the system distinguish between changes that
are because there was a mutation in an underlying system and changes due to the
fact that there are shifts in the external factors.
>> Raja Sambasivan: Right. So an assumption of this work flow, and I'll get
to this in a bit, is that the work flows are roughly similar between both
14
periods. If they are very different, yeah, you would see effects due to
workload changes. We do try and handle a bit of it by saying just extra load,
without actually trying to identify as extra load specifically. But in
general, we do see some [indiscernible] in those periods.
All right. So here's an example of a response time mutation. There's a
request that accesses the NSF server in Ursa Minor. Reads data from the
storage node and then replies. It's just a very simple request. We see it's
much slower in the problem period as compared to the non-problem period. In
the problem period, this request took about one millisecond. Whereas in the
non-problem period, it only took about 110 micro seconds. So in addition to
identifying this as a response time mutation or as categories containing
response time mutations, Spectroscope will it rate through the edge latency
distributions seen here and will identify this, the work done in the storage
node as responsible for the timing change by applying the same statistic tests
on the edge latency distributions.
So once again, for response time mutations, conditions identifying them, the
problem is localized by identifying specific interactions responsible for the
overall timing change.
So the second type of mutation Spectroscope considers are structural mutations.
These are requests of change or topology of the structure. And the problem
period, they visit different components, they execute different functions.
They have different parallelism than the non-problem period.
So to localize the cause of root cause of these problems, what we want to do is
we want to identify what we call the precursors or the anticipated path of the
same request during a non-problem period. So before a performance change, what
would this request have looked like structurally is what we want to find out.
And this is necessary for two reasons. First, it lets us identify the cost of
a structural change. If you execute a different component, is that slower?
How much of a performance change do you see as a result of that? That's
something to identify. The second reason we wanted to it fie the precursors is
to identify how these two differ or diverge. Because that's the point where
the developers should start looking when they want to debug this problem.
So to find a starting point for the diagnosis efforts, we want to identify
where these request flows, precursor and mutation, start to diverge. So an
15
example of a structural mutation precursor pair is when the metadata
prefetching problem I showed you earlier in the talk where a structure mutation
is a request that accesses the metadata server and incurred many repeated data
accesses where in the flow before that change, right, their precursor would
request that it access the metadata server but incurred men fewer of those
database lock accesses.
Another example of a structural mutation precursor might be requests that now
must get their data from far off data centers. Some are, I don't know -Vancouver is not that far, right? So somewhere in Asia, where before they were
getting the data from somewhere in Vancouver. Or maybe Redmond, right.
So once again, Spectroscope has identified both the mutation and the precursor
of the mutation. They anticipate path of that request before the performance
change.
To identify categories that contain structural mutation we assume similar but
not identical workloads for execute both periods. And so what this means is
that categories that contain more problem period requests than non-problem
period requests must contain structural mutations because those extra requests
must have come from somewhere. They must have mutated from something else.
Similarly, requests that contain less problem period requests than non-problem
period requests must have donated precursors, because they've increased in
frequency so those requests have gone somewhere else. In this case, we
actually use a different threshold to differentiate natural variance from
categories of mutations.
So you might say there are category contains 50 more problem period requests
containing a structural mutations and a category contains, say, 50 less
non-problem period requests contains precursors.
>>: If you think about your example, I mean, it sounds like you've actually
instrumented [indiscernible]. But if you just look at the network message
layer, then what you would, in fact, see is that the structural mutation or the
cases where you cache it on the front end, they went faster. And so what you
find is that the problem categories right ones that didn't [indiscernible].
>> Raja Sambasivan:
So can you clarify that question again?
16
>>: So if you had instrumented [indiscernible] and you didn't see this
database lock over and over and over again, and you compare the precursor set
of -- before you did your optimization set, which looked like that complicated
path, I mean, even if you just skipped [indiscernible]. And then you looked at
the piece after authorization, there is a certain mutation. Because in the
previous search you always went up to the metadata server and back. In the new
one, you most of the time don't go up to the metadata server and back. So
structural mutations that you're chopping off is not --.
>> Raja Sambasivan:
Right.
>>: And the places where they vary, where the change has been introduced, your
performance is better. The structural mutation is positive. Which presses the
whole point of [indiscernible] optimization. And the real problem is that you
->> Raja Sambasivan: We were talking about that specific example. So which you
would see is you would see that, yeah, there are extra cache hits so you can
see that's faster, right. You would also see that when you did go to the
metadata server, the time you spent at the metadata server would be much
larger. So you would see both.
>>: You made the statement that here what you're going to do is you're going
to find certain mutations and this is the source of the problem. And, in fact,
the structural mutation at the network message [indiscernible] was not the
source of the problem. In fact, it was a benefit. Does that make sense?
>> Raja Sambasivan: Yes, you would see certain things have gotten faster.
It's a structural change instead of negative performance impact, right. But we
would also show you that there were certain other requests that were gotten
slower because when they do go to the metadata server, they take more time.
>>:
In the first one, you would just ignore it.
>> Raja Sambasivan: In fact, it wouldn't show up in the ranked list. The
second one is the thing that made it slower. And I think you bring upon a
bigger point. Depending on your information, timing changes turn into
structural changes, right?
>>:
[indiscernible] deeply enough if you had submitted all the way
17
[indiscernible] because it's deterministic in that layer?
>> Raja Sambasivan: Yeah, and so it is -- you can also imagine extending this
so say that you had a really large system. You couldn't afford to instrument
everything starting out outright. You could imagine having timing changes ->>:
[inaudible].
>> Raja Sambasivan: But just within a large system, right, you might want to
start out with component level instrumentation like you started out, right.
And then ->>:
That's exactly what I stick to top level [indiscernible].
>> Raja Sambasivan: And you would expose additional information within
components as you saw timing changes.
All right, so ->>: Before, about ten minutes ago, you sort of said thresholds are bad, you
said because thresholds are bad because [indiscernible]. And here, you're
saying we'll just use thresholds.
>> Raja Sambasivan: I really wanted to use a statistical test here too. I
wanted to build a model of what types of structures we saw for various
workloads and use that, and I think you could definitely do that and there's a
straightforward way how to do that. But I looked at lot of this while I was at
Google, and they told me their request structures change often enough because
they're a bunch of uncoordinated teams changing things at different levels. I
think table guys might push on upgrade randomly.
They didn't think it was worth to take the effort to do that because they
didn't think the models would last long enough. So that's why I didn't do that
here. So is there another question?
All right, so once we identified the [indiscernible] categories that contained
structural mutations, and the categories of donated precursors, we still have
this mapping problem, right? We want to identify which precursors are likely
to have donated how many requests to which structural mutation categories.
18
So we did this using a combination of three heuristics so we use a weighted
assignment scheme that uses the three features shown on this graph. First, the
overall type of the request, read versus write. So we assume that reads can't
turn to writes, for example. That's a very basic assumption, right?
Second, we look at the count of requests in each category. So precursor cannot
donate more requests than it's decreased by in both period. Here NP shows
non-problem. P shows problem. So this precursor can only donate about, what,
850 requests or so.
And this mutation can only accept up to whatever 917 minus 480 is there, right.
And third, structural similarity. We assume that precursors are likely to turn
into requests that look more like them, that the changes will be localized,
essentially. So we use these three features to create a weighted assignment of
how many requests the structural mutation category can accept from a given
precursor.
Here's an example of a structural mutation again.
Sorry.
>>: You said you make a weighted assignment. Does that mean that you
personally had to assign some arbitrary weights? Is there any meaningful way
to compare those, or you just kind of have to [indiscernible].
>> Raja Sambasivan: You have to take a bunch of features that -- so the
question is how did we pick these heuristics, basically, right?
>>: Just basically heuristic in terms of -- is there a sensitivity analysis
there?
>> Raja Sambasivan: So there isn't really, I mean, there isn't a perfect way
to do this. It isn't possible to figure out what likely turned into what else,
right.
>>: Is there even a way to know what the -- the assignments that have been
made are fairly insensitive to change or if you're ->> Raja Sambasivan: Right, right. The only answer I can give you there is
it's work in practice across all the problems we looked at. And the assignment
is simply, it's a linear thing. So there's the three features, they're
weighted equally, and if your precursor mutation looked very similar
19
structurally, then, you know, your weight you give that particular relationship
is higher.
>>:
They can't be weighted equally because they're different units.
>> Raja Sambasivan: They're normalized equally. So I guess, sorry, yeah,
you're right. I mean, the only thing that determines how many requests are
contributed is the structural similarity, I guess. So sorry, I managed to open
up a -- so here, we had three features, right. The number of requests in each
category just determines the upper limit of how many requests can be
contributed.
Second, the type of the request, we assume that reads can't turn to writes and
the third is structural similarity, right. So the first two are basically
limiting cases and the third thing determines how many requests you contribute.
So here's an example of a structural mew cation precursor here. Once again
metadata prefetching problem that I showed you earlier in the talk, where
mutations with these requests accessed a metadata server and incurred many of
these repeated database locked accesses. Whereas this precursor where it
requested access to the metadata server and incurred many fewer of these
database accesses.
And once again, for structural mutations, the root cause is localized by
identifying how the mutation and how the precursor differ from each other,
because this is what gives developers a starting point for the diagnosis
efforts.
So we've identified the total set of categories that contain the response time
mutations and structural mutations, we still need to rank them by their
contribution to the overall performance change. And so ranking is useful for
two reasons.
First, there may be more than one problem in the system. And in that case,
this ranking is useful to help focus the developer effort in the problems that
affect performance the most.
The second reason is because one problem can actually create many different
mutations. So consider the case of a bug in a cache that causes more requests
in missing that cache in the problem period, right. So great, one, mutation
20
requests a miss in the cache versus hitting that cache previously. But those
same requests are going to go and affect the next level of resource in the
system, perhaps the next level of cache, causing more requests in that cache.
So you have a second structural mutation. So you see that there's a whole host
of structural mutations introduced by one problem and so to help developers
navigate through this entire host of mutations, ranking is useful.
The ranking scheme is very simple. Intuitively, it's just the number of
requests requested times the change in response time. So for response time
mutations, there's a number of requests in that mutated category times the
change in average response time between the problem and non-problem periods.
For structural mutations, it's the number of problems affected, the number of
extra requests in the structural mutation category times the change in response
time between the structural mutation category and its precursor category.
>>:
--
So the rank is not at all based on statistical significance that you got
>> Raja Sambasivan: Statistical significance is used as a filter at the very
first step, so if ->>: So if something is under five percent, it's in, if it's over five percent,
it's out?
>>:
Other than that, it's just straight?
>> Raja Sambasivan: Yeah, so it's a top level filter, right.
I can talk a little bit about that later in this talk.
>>:
Work flow.
And
[indiscernible].
>> Raja Sambasivan: Yeah, it might be, actually. We can either talk about
this offline or later in the talk, but I have some slides on visualizing these
results and this comes out there.
All right. So far, I've described how we've obtained the non-problem period
graphs and problem period graphs. How they're categorized together into these
groups, how these categories are analyzed to see if they contain mutations,
either structural or response time, and how they're ranked together, based on
their contribution to the overall performance change.
21
The next step of Spectroscope's work flow is to present these results to the
user in hopefully a pretty user interface. Hopefully. So I believe
visualization is really important for automatic problem localization tools,
like Spectroscope. These tools don't get to root cause of a problem
automatically. Rather, they only localize the problem and they leave that last
logical leap of getting to the root cause to the developer.
So it's really important for them to present the results well. Here's a screen
shot, a real screen shot of one of three potential user interfaces we're
developing for Spectroscope. We just started working on it. But I wanted to
show you guys a real picture. The UI shows a ranked list of the different
mutations. The numbers are just IDs of the different mutations. But the key
thing is they're ranked according to their effect on the performance change.
Users can click any of them. Here, we selected the first highest ranked one,
and the user can also select whether he wants to see the mutation or the
precursor. In this case, we selected the mutation. And the graph shown is the
request flow for the mutation.
User can also decide to see the precursor instead. In this case, the graph
will change. With this interface, we also have this nifty little start button
that will quickly animate between both precursor mutation. So it will tease
out the structural changes. In fact, I'm so excited about this work that I
have another slide on it.
So let me start this thing here. So what you see here is the mutation for the
metadata prefetching problem that I've been using as an example throughout this
talk.
So the tool will show you the mutation and then users can scroll through it,
and they might see hey, look, there are a lot of these accesses here. You see
the green boxes, MDS lock acquired, MDS lock released, right. And seems to
keep going for a while, because we have these locks instrumented. And in the
end, there's some other activity, RPC to another component shown in yellow. We
come back to the metadata server, see all the metadata locks again.
So at this point, you might want to scroll back up to the top and say oh, look,
did this happen before, in the precursor, before this performance change? So
what we'll do in this case is maybe center ourselves on one of these locks, we
start the animation, and then we see that these locks just seem to disappear.
22
And what's interesting is that the request seems to
other instrumentation points move up. So you might
hypothesis by zooming out of it, which I think will
yeah, look, the graph does get much shorter without
And so maybe this convinces you this should be ways
efforts for this particular problem.
get shorter. You see the
want to confirm the
happen in a second. See
those locks. Check it out.
to start your diagnosis
All right. So that was the work flow for Spectroscope. Next, I
about some case studies of using Spectroscope on real problems.
the really fun part of this talk. So we've performed eight case
of using Spectroscope to diagnose real and synthetic problems in
Minor and Google.
want to talk
And this is
studies so far
both Ursa
Most of our experiences using Spectroscope are qualitative. They're
experiences of real -- they're real experiences of developers debugging real
previously unsolved problems using the tool. But we also wanted to get a
quantitative sense of Spectroscope's utility. So for the Ursa Minor case
studies, we compute two metrics over the ranked list of results that it
generates. First, the percentage of all the results it generates that are
relevant or useful for debugging the problem. Second, the percentage -- the
first is the top ten results. The percentage of the highest ranked ten results
that are useful for helping developers debug the problem. And second, the
percentage of all results that are useful for helping developers debug the
problem in hand.
The first metric is clearly the most important. Developers are going to
investigate to get highest ranked items first. So it's important that those be
useful. It's kind of like searching on Google. But the second gets that
overall idea of the robustness of the output returned by the tool.
And I just want to be full [indiscernible] about the word relevant here. So
when I say relevant for problems we haven't diagnosed before, what we do is we
have a developer or we ourselves would go in and use Spectroscope to localize
the problem to just a few components, a few interactions and use that to get to
root cause. And once we identify the root cause, we go back and say oh, look,
here are all the results that Spectroscope generated. Did they lead us in the
right direction. Did they identify the right changes that were on the track
towards the root cause.
23
So here are the quantitative results for the six problems we diagnosed in Ursa
Minor. The X axis shows the different problems. And the yellow bar shows a
percentage of the top ten results that are relevant. And the blue bar shows a
percentage of all results that were relevant.
And you see here, most cases, our results were pretty good for the
configuration problem. You see 100% of the results returned were relevant for
helping debug the problem and 96% of all results were relevant shown by the
blue bar here.
In the worst case, I believe only 46% or 50% of the top ten results were
relevant, and only about 40% of aug results were relevant in debugging the
problem. But even in that case, the top ranked result was something that was
useful in helping developers get to root cause.
Once again, it doesn't really matter if you guys know the names, what these
names mean, just that they're different problems. And this last problem, the
spikes problem here has a not applicable in front of it because it was an
example of using Spectroscope to show that nothing essentially changed and
telling developers that they should really focus their debugging efforts
elsewhere, in the environment, for example.
>>:
[inaudible].
>> Raja Sambasivan: For the spikes one? So it was an interesting case study.
So what happened in this problem is that we noticed that every two weeks or so,
one of our bench marks that we ran, we run these nightly benchmarks in Ursa
Minor. Every two weeks, one [indiscernible] run much slower. And only every
Sunday, every other Sunday, right.
And so we used Spectroscope to compare request flows between a slower run of
that benchmark and a normal run. And it generated nothing. Like there were no
timing changes internal to the system. There were no structural changes, so
the axis we were considering for changes showed no difference.
And so we thought okay, maybe let's take Spectroscope at its face value here
and say make it isn't the system itself. Maybe there's something else going
on. And we eventually correlated to scrubbing activity from the net filer that
the client was running on. So what was going on was that CPU activity was
causing requests to be issued more slowly from the client and the arrival time
24
requests was different.
All right. So as I said, our experience with using Spectroscope is
qualitative, they're examples of real people, real developers using
Spectroscope to diagnose real problems. So I want to give you guys an example
of a work flow of how Spectroscope fits into the daily routines of developers.
And so I've actually already described the prefetching problem here and that's
the case study I've been using throughout this talk.
I'd also like to talk about this configuration problem. I want to point out
before we move on that all of our Ursa Minor case studies used a five component
configuration of Ursa Minor with one client, one NSF server, one metadata
server and two storage nodes. One storage node was used for storing data and
the other one was used, was dedicated to metadata server for storing metadata.
All right. So at some point a developer of Ursa Minor committed a pretty large
piece of code that caused a huge slowdown in a variety of the ten benchmarks we
saw in the system. So the question is what was going on here? We used
Spectroscope to compare request flows from after that are he gression so before
the regression and we found that Spectroscope identified 128 total categories
of mutations.
It turns out most of these categories showed the same structural change. Where
in the non-problem period, the precursor, in the precursor request flows, the
metadata server was accessing its own dedicated storage node to read and write
metadata. In the problem period, the metadata server was instead using the
data storage node for most of its accesses and this was increasing the load in
the system slightly, and so performance slowed down.
So this is great. We've used Spectroscope and localized the bug to an
unexpected interaction between the metadata server and the data storage node.
But you know what? It's still not clear why this problem is occurring. This
isn't the root cause, right? Why is the metadata server all of a sudden using
the wrong storage node. And so we had this hypothesis that the mutations, the
change in the storage node access was perhaps caused by change in low level
parameters. Perhaps the client had changed some configuration, was sending
some function called parameters that simply said used a different storage node,
perhaps changes in the RPC parameters or in function call parameters somewhere
that were telling us to use a different storage node.
25
And so to investigate this hypothesis, we used a feature of Spectroscope that I
haven't described yet. I'll quickly go over now. And what this feature lets
you do is lets you pick a structural mutation category as a precursor and
identify the low level parameters, a function called parameters climb
parameters and so on that best differentiate the two.
In order to do this, we use a very classic machine learning technique,
regression trees.
So when we picked the highest ranked structural mutation category, and its
precursor category, and we ran it through this feature of Spectroscope, we
found that the Spectroscope immediately identified changes in the fault
tolerance scheme, the encoding scheme for data, and things like that.
It turns out that these parameters are only set in one file. It's this really
obscure configuration file that has never changed, except for the developer
changed it just for the purpose of testing and had forgotten to change it back.
The interesting thing about this is this configuration file does not live in
the virtual control system. So no one even knew it had changed.
So this was a root cause to a very obscure configuration file in the system.
There are seven more case studies in the NSDI paper we did. If you guys are
interested, I encourage you to have a look at it. It's really great, because
it reads like a mystery novel. So if you guys like mystery novels, this is
really the paper to read.
But before I conclude, I want to talk a bit about my future research agenda.
All right. So to motivate my future research, I want to place the work I've
done so far in the context of modern data centers and clouds. So we all know
these environments are growing rapidly in scale. They contain many complex
distributed services that depend on other complex distributed services to run.
So, for example, at Google, a distributed application or dependence
configuration, a table store such as a big table such as distributed file
systems such as GFS, the scheduling software decides where to place different
applications, the authentication mechanism, the network topology and so on.
So if you guys thought problem diagnosis was hard so far, say based on this
talk, it's only getting harder.
26
And so a lot of experts believe that because it's increasing difficulty, we
need -- there's this need for creating self-healing systems that are capable of
fixing a lot of problems automatically without human intervention.
And this also forms the basis of my long-term research goals. So here's my
road map to creating these self-healing services. This is at least three
steps, right? The first step is to continue to create techniques that help
developers diagnose and understand complex distributed services.
For example, by automatically localizing source of a new problem to just a few
relevant components, right. Developers possess a large amount of knowledge and
insight, which they synthesize to diagnose complex problems. So hoping to
replace immediately just isn't a feasible approach.
The second step is I like to call it this yang to the first step's yin, I
guess. In that what we all seem to do is need to learn how to build more
predictable systems. The first step says learn how to debug and understand
complex systems. The second step says hey, look, there might be simple things
we can do that can help us build systems that are easier to analyze and easier
to understand.
A simple example is many automated problem localization tools like Spectroscope
expect variants to be bounded or limited in certain axis to work. The system's
predictable in this respect. And so a tool that helps developers find unwanted
sources of performance variants in the systems and potentially move them if
they think it's appropriate would be something that would help these automatic
localization tools work so much better. It's just one simple idea there.
And the third step is use the techniques that work best from creating these
tools to help developers and use them as a substrate on which to build further
automation. For example, automation that takes automatic corrective action,
which is the next big step towards self-healing systems.
A large portion of my graduate career so far is focused on this first step.
Request flow comparison, my thesis work, is the obvious one. I also studied
what knobs and data I need to expose by distributed system in order to enable
better tools and more automation in the context of building distributed storage
system called Ursa Minor. For example, I helped build the end-to-end tracing
system that was used in Ursa Minor for a lot of these case studies.
27
I helped create parallel case, which is a tool for helping identify
dependencies between nodes in parallel applications. And then I also helped
create a tool for predicting the performance of workloads that moved across
storage devices, which is our work in symmetrics a few years ago.
But there's still a lot more to be done here. I think one key area is tools
for automatically localizing problems in what I call adaptive cloud
environments. So these environments essentially schedule workloads so it's
increased utilization of the data center as much as possible. So to satisfy
global constraints, they may move running workload from one set of machines to
another, which are running a complete different set of other applications than
the set you were running with before.
If the amount of resources available in the system is high, because there are
not many other things running, they may give you the lot of resources. They
may give you a lot of CPU. They may give you a lot of memory, only to steal
back those resources the minute some other application needs them.
So all these things are great. They increase utilization in the data center a
lot, but they do so at the cost of predictability and variance. And there's a
lot of existing localization tools out there simply not as useful in these
environments.
And so I think that for these environments, a new class of tools will be
necessary, and these tools will have to have intimate knowledge of the data
center's scheduler's actions. They'll need to know when the scheduler decides
to move an application from one set of machines to another so as to note
expected performance change and so as to know how much of a performance change
to expect. For similar reasons, they'll have to have knowledge of the data
center topology and the network. And so I'm interested in building these tools
in your future.
I'm not going to talk too much about the second step here. Instead, I want to
spend a quick slide on the third step. So to get us to where it's taking
automatic corrective actions, I think a bunch of sub-steps or courses of action
are necessary. First, we need to catalog actions that developers take when
diagnosing performance problems. Some actions might include moving workloads
to different machines, reverting code, plus adding resources and so on.
Next, we need to find good machine learning techniques to help map between
28
these problem symptoms and potential actions.
And third, and I think this is the most important step, is we need to create
models that will help us estimate the cost and benefit of taking these various
actions. For example, would you pick an action that is guaranteed to fix a
problem, say in Amazon, EC2, right, but will result in guaranteed down time, or
would you pick an action that may or may not fix the problem. Say it's 30
percent guaranteed to fix a problem, but will result in degraded performance
for, let's say, six hours or three hours or so. How do you choose between
these different options? And I think various models, economic scenarios and so
on will be necessary to help us gauge these different potential actions. And I
think once we have some initial work here, we will really be on the path
towards creating tools that take these actions.
So I'm done with this talk. To quickly summarize, I described request flow
comparison, my thesis research. I showed the effectiveness of request flow
comparison by showing how this is used to help diagnose real, previously
undiagnosed problems in two systems, Ursa Minor and certain Google services.
I'm also happy to report that one startup company I've talked to has actually
implemented a request flow comparison and are considering using it in a product
which I thought was really cool. And finally, I described my plan for
achieving the goal of self-healing systems of which request flow comparison is
the first step. I believe future research in this path will require strong
collaborations between systems people like myself, machine learning experts,
statistics experts, and visualization folks. And so I look forward to working
with them towards this larger goal. Thank you.
>>:
--
So would you talk about your decision to use the statistical significance
>> Raja Sambasivan: Oh, yes, I forgot about that. So yes, you're right, the
initial Spectroscope ignores anything of these timing changes aren't
statistically significant, right. And it turns out that that sounded like a
good idea when we started, and you still think that's a reasonable approach.
One of the things we ran into is we knew the visualizations. We actually ran a
user's study with the different visualizations to compare them.
And what we found is that statistical significance is weird. You could have
something that has a large timing change, really high variance, and that isn't
29
statistically significant. You could have something that could have a really
small timing change but really low variance and it is statistically
significant.
So what would happen is that the users would see some of these things that
weren't identified as statistically significant, automatically marked by the
tool. And because they weren't experts in statistics, they would start
wondering why is this tool showing this change but it isn't doing that change.
And so we started out saying, like, false positives are really important. We
want to avoid them. Turns out false negatives are really important too,
because they convince people that perhaps their mode of thinking is not in line
with the tools and maybe they're doing something wrong or maybe the tool is
doing something wrong.
So I think moving forward, we need to modify the test we use a bit to account
for these issues.
>>:
Modify in what way?
>> Raja Sambasivan: Well, you can imagine weighting our statistical
significance of that if, for example, you do see a timing change that is really
large. But has high variance. Maybe you identify that anyway. So maybe you
have a continuous spectrum of normalized values that says, yeah, statistically,
this isn't significant, but accounting the magnitude of the change, I want it
identified anyway.
You might not identify really small timing changes as statistically
significant. So I think some kind of normalization factor that accounts for
the magnitude would be useful.
>>:
Should that happen on a parametric test?
>> Raja Sambasivan:
Why not.
>>: For instance, if I'm using a [indiscernible] to compare two sets, all it
cares about is the fraction of one set that's greater than the other set. So
you tyke two, A is greater than B, and so on and so forth. So, so long as it's
not a matter of the variance as much as whether one set tends to be greater
than the other set. Seems to be --
30
>> Raja Sambasivan: Yeah, so that could be true. But then the magnitude of
that change would still matter. But I might be interpreting your question
incorrectly, but it seems like you could still have lots of timing changes that
are very small that perhaps you wouldn't care about and still one set would be
greater than the other set. It's just that those changes were small enough not
to really matter.
>>: A non-parametric test is one where you're not assuming any distribution to
variance is generally something that you see in a test that assumes a normal
distribution.
>> Raja Sambasivan:
offline, I guess.
I don't think I agree with that.
But we should take this
>>: So what fraction of the real diagnoses that you guys played with required
the structural limitations? Was one of those tools [indiscernible] or were
they both critical?
>> Raja Sambasivan: That's a good question. So the question is, were there
things with more structural mutations or response time mutations. In Ursa
Minor, a system still in development, we saw a lot of structural changes and I
think that's because things are changing often. People are putting in
different algorithms, different cache replacement policies and so on.
At Google, we saw more timing changes, but also their tracing [indiscernible]
wasn't as detailed as the one we had in Ursa Minor so it might be that it was
just masked. So it's hard to say. One thing we found is in this talk, I
talked about structural limitations and response time mutations being distinct
entities.
In reality, you usually have structural mutations and timing changes.
>>:
In the same category?
>> Raja Sambasivan:
>>:
[inaudible].
In the same category, right.
31
>> Raja Sambasivan: Yeah, that's right. In fact, one approach I've been
thinking about for instrumenting larger systems, like we instrumented HDFS of
using end-to-end tracing and we found that because of the granularity of the
rights that are issued, which are usually like 64, 128 megabyte chunks, which
are then broken down into like 64 kilobyte packets and pipelined to three
different replica, we get really large [indiscernible] from them, because we
end up, the raw instrumentation, if you want to capture low level activities,
captures a 64 kilobyte packet. So we end one graphs that natively have like
5,000, 6,000 nodes in every single write.
So among the approaches we're considering is this zooming approach. We
actually exclusively exploit this notion that timing changes turn to structural
changes. You start out ->>:
You turn the knob at [indiscernible] time rather than at collection time?
>> Raja Sambasivan: Or the tool automatically does that. Starts out with a
much lower granularity graph that is relatively small, it finds change in
specific areas and zooms in just those areas and exposes detail.
>>:
Exposes structural changes.
>> Raja Sambasivan:
Yeah, that's exactly right.
>>: [indiscernible] basic performance issue can be very complicated. In some
of the cases, [indiscernible] depending upon workload and [indiscernible].
>> Raja Sambasivan:
So you say the path depending on [indiscernible].
>>: Yeah, [indiscernible] performance issues and really fine [indiscernible]
performance issue because of the this workload characteristics [inaudible]
cases we see is [indiscernible]. Can you comment on how your tool is going to
help [indiscernible].
>> Raja Sambasivan: Let me clarify the question a bit. Are you talking about
dependencies between requests, or are you talking about large write ->>: I'm talking about maybe in the [indiscernible] systems.
kind of load. The systems performance may be different.
Depending on what
32
>> Raja Sambasivan:
Right.
>>: And there may be slight change of the load from the [indiscernible] point
of view who saw them as similar, actually [indiscernible].
>> Raja Sambasivan: Right. So yeah. So let me answer the first part of the
question as differing load. One of the things we looked at adding to this
things is a notion of simple queueing theory. So you can imagine saying that
the expected response time of different types of requests should scale linearly
or sub-linearly with load and to a certain point at which it's bound to
increase rapidly.
So we looked at incorporating that model in and not identifying some as a
mutation if a load is greater and the performance is only increased by a
relatively small amount. That's accounted for by the queueing model. So
that's one way of accounting for small load changes.
I'm not sure if I got the second half of that question or maybe I even ->>: May be a real basic problem. [indiscernible] if a system is setting
[indiscernible] if you're setting mostly large read, when you're sending small
plus large read, big performance issues for large read [indiscernible].
>> Raja Sambasivan: So yeah, so you're talking about dependencies between
things, right. Resource contention is another example of this, right. So
certain slowdowns might not be caused by the system software. It might be
caused by another request another client is trying to access that same
resource. The question is what do you do there.
This tool currently will not identify that as a root cause. It will show you
that something has slowed down. It won't say it's because of some other
request. We have looked a at -- that's actually one of the key pieces of
future work we're looking at, how do you deal with things like resource
contention. Could you extend the tool that says here's a slowdown, here's a
structural change. The reason it occurred is because there's other requests.
This other client is trying to use the same resources or interfering badly with
this request.
>>:
Your answer is only [indiscernible] resource.
There are other issues
33
beyond resource. [indiscernible] has proven itself is complicated. Depending
on [indiscernible] workload put in also depending on the physical
characteristics of the underlying device, there may be issues. And maybe the
issue only [indiscernible] specific combinations.
>> Raja Sambasivan:
>>:
Yeah.
And that combination actually is [indiscernible].
>> Raja Sambasivan: An example of this is configuration, for example. Yeah,
this tool, what you really want there is a tool that, I think what you really
want there is a tool that explores the different configuration options. This
tool will not identify that automatically for you. It won't say that these
specific combinations of things, if you ran them, are bad. You'd have to
explicitly run them. And then look and say oh, they're bad. That's an
interesting space of research there that this tool doesn't cover.
>> Jon Howell:
One more question.
>> Raja Sambasivan:
Sure.
>>: I'm trying to figure out the purpose of this statistical significance
testing rather than, say, looking at the change of where the fraction of my
total time has gone. And I'm thinking about the situation in which you've got
a caller that maybe takes a million units of time to do something and something
that calls that takes one unit. And if we realize that we can make a change
where the callee is now going to taking from one unit of time to ten, but we're
going to save 100 units of time on the caller, so it's a net win by a factor of
ten. But when you look at the callee, it's gone from one unit to ten units,
big statistically different change. The caller is going to have -- is going to
go to a million, to a million minus a hundred. So that's going to be lost in
the noise. So the statistical test will say we've just made things a lot worse
in the caller. The callee will be lost in the noise. And something else along
with this improvement that screws something else up somewhere so that it looks
like this is -- so the system is now performing worse.
Won't this send me to exactly the wrong place?
>> Raja Sambasivan: Well, no, because first of all, we're running the original
[indiscernible] test. I'm not sure I'm answering this question correctly, but
34
my answer be would we're running the original statistical test on the
end-to-end timing, right. So their sum time is a sum of all the other times.
So we've identified that request flow as being a mutation or having a change
because of the end-to-end timing change.
And then we're looking at each of the individual components. Each of the
individual edges. We're running the same test and we rank the ones
statistically significant, the edges as statistically significant or the
interaction as statistical significance by their timing change. So you say
this thing changed by a thousand units, and there are these ten interactions
that changed within that request flow statistically. And these are the ones
that affect that end-to-end timing the most.
And so you do have this drill-down process, I guess. So you're initially
looking at the entire coherent picture and then you're looking at individual
pieces of it and seeing how they contribute to the overall picture. So there's
a correlation between all the times, I guess. Maybe I'm not answering the
question.
>>:
I do not see where you're running the test.
>> Raja Sambasivan: So the tests are run at multiple levels. They're run an
at the end-to-end time which incorporates all the other times and they're also
run in all the pieces that build that up. So you can ->>: The end-to-end time of this request were bad because there's something
messed us elsewhere, then you go in and look at it.
>> Raja Sambasivan: This tool will not identify any of the edges. If there
are cross dependencies like that, like for example resource contention might
cause this, what you do is you might identify the request flow as having
significant timing change, but none of its specific interactions might be
identified as contributing to it. That might be one use case. More likely,
there will be a bunch of edges that have changed as a result of contention.
The high level, what you're saying you might see the end-to-end timing change,
but you're saying none of the interactions within it seem to have changed
because something else is causing the problem. That's the high level statement
you're making, right? Is that the case you're considering?
35
>>:
We'll take it offline.
>> Raja Sambasivan:
>> Jon Howell:
Okay.
Thanks.
Download