>> Jay Lorch: Thank you, everyone. Thank you... excited to be joined today by Ari Rabkin, who is...

advertisement
>> Jay Lorch: Thank you, everyone. Thank you for coming. We're
excited to be joined today by Ari Rabkin, who is going to be
interviewing for no less than three positions here. Researcher
in distributed systems, cloud computing storage, and rise. Ari
got his Ph.D. in 2012 from UC Berkeley and has since then been
flourishing as a post doc at Princeton, where he's gotten NSDI
paper, a Hot OS paper, and a best paper award at OOPSLA. So to
give you some background, he's a systems researcher, but with a
lot of background you don't often see, background in static
analysis and other programming languages techniques and
sociological methods. He's driven by the desire for real world
impact as evidenced by the widespread adoption of Chukwa, his
tool for gathering and analyzing logs from large distributed
systems.
So please join me in welcoming Ari.
>> Ariel Rabkin:
>>:
Hi, folks.
What does Chukwa mean?
>> Ariel Rabkin: The Yahoos picked the name. They assured me
that it was, in Indian myth, the turtle that supported the
elephant that supported the world, which seemed like a good name
for a Hadoop spin notify. I cannot vouch for the anthropology of
this. I just repeat what I'm told.
Let me give you a sort of high level view of what I do and why I
do it. So my goal as a researcher is really about making
distributed systems easier to use. That we have very complicated
software, and every day we are finding newer and cleverer ways to
build complicated systems. We still have users. I would like to
bridge this gap.
Let me give you a sort of high level view of, therefore, what I
think systems research in particular is. We have problems. We
have many problems. We solve those problems by building software
systems. We have users. Our users interact with our software
systems. This gives them new problems. I have worked sort of
throughout this space. I've done some work on analyzing the
things that make systems break at failure analysis. I've build
various software systems to meet various needs. Jay mentioned
Chukwa. I will tell you about JetStream, which is appearing here
again in Seattle in a month. I've done a lot of work on
configuration debugging, which is about fixing the gap between
the system and the user, and I've looked a bit at the language
adoption problem, which is really about how our users then adopt
technologies that change the set of problems they have. And I've
done a little work on system education, let us say. How to teach
people about new systems tools.
Here today, I'm really talking about JetStream and I'm talking
about the configuration debugging work. It will be about two
thirds, one third. Let me now give some technical context.
Once upon a time, when people talked about big data, the thing
that they had in mind was web services. And web services can
have a lot of data. It can be petabytes of data. Typically, at
some centralized site or small group of sites. These days, when
people talk about big data, really this means automatic data
collection. That we have smart sensors in our houses. We have
very smart sensors in our pockets. All of our systems produce
log data. Every cyber physical system, be it a highway or a
power grid, produces log data. I was alarmed to discover that
the latest generation of soda fountains from Coca Cola are
transmitting a continuous stream of data back to Atlanta about
what are people drinking and how much.
So everywhere you look and some places you didn't think to look,
there are sensors that are producing quite substantial volumes of
data. It could be exabytes. And a thing to note about this data
is that it is dispersed to, when it's created at it's at every
telephone, at every smart meter, at every soda fountain
throughout the world. And the software with which we process it
is, therefore, also changing. There was a time where when people
thought about big data, the people, at least the ones who weren't
at Microsoft thought about Hadoop. The mascot is an elephant.
It's large, it's slow, it's powerful. It will stomp on your
data, but it won't be really very nimble and it won't be very
flexible.
These days, we have a very much richer software stack that you
have, if you are in the open source world, a very wide variety of
choices of storage layer. This is going to be really about the
open source view of the world. I understand that large companies
have often a smaller and more cohesive stack but even in a large
company, you will interact with outsiders. Those outsiders have
many choices of storage system. I could have drawn this going
off to the wall. Once you have your data stored, you have your
choice of how to process it. There's a wide variety of execution
tools. Again, I could have a drawn this out a long ways. These
tools can be low level. People build higher level languages and
tool sets on top of them for processing. Could, again, imagine
there's more. But so you have
and even above this, there's a
management layer where you have sort of coordination tools,
things like Zookeeper, right, which is a distributed log manager
and coordination service. So we have this immense set of
software tools that we now have available to us to process our
data. And these things must somehow be stitched together, right.
For every pair of these, you have an interface problem where you
have to make sure that these systems talk. And there's one other
thing I want to draw your attention to is that we have a
different user population than we are used to. There was a time
when people's notion of who was using our software systems and
who was processing big data was technical experts who were
qualified to write high performance programs. That is not the
case any more. These days, the hot new thing is data science.
People say it's, you know, the sexiest job of the 21st century.
These people are not programmers, really. These people are
somewhat of a programmer, somewhat of a statistician, somewhat of
a domain expert. They're not people who want to spend their time
thinking about caching and about data locality. They're people
who have a query and they want to run their query and move on
with their lives. So this is a different population than we are
used to building systems for. The original MapReduce, Azure,
what have you was targeted at expert system developers. That's
not the audience of the future.
Let me now sort of put these trends I've outlined together. The
scale of data is going up. The technical sophistication is going
done. Resource management, that's going to be a bigger problem
as a result. That as the humans are less technically focused,
the system has to do more of that. And likewise, as the systems
are more complex, configuring it, that's going to have to be a
bigger challenge, right, that users are going to need help
stitching all of this software together. It gets more
complicated as they get less interested in it.
I'm going to start by talking about resource management. As I
mentioned, there was a time when people's data lived in a
centralized datacenter, and it would feed in a little bit as
users interacted with it, but mostly the data lived in one place.
That is not the world of today. That approach does not scale.
Therefore, we're going to be in a world where there's sort of
dispersed data throughout the world. I drew four. You could
imagine 400 or 4,000. The data will be, I think, progressively
dispersed, since our ability to generate it is growing far faster
than our ability to move it.
And this brings me, therefore, to JetStream, which a system that
I and my colleagues at Princeton have been building for analytics
in this space. Our goal is analytics; that is to say, not
transaction processing. The assumption is you have data coming
in. You have queries, some of which you've had for a while, some
of which are new. You'd like to query this data. And the thesis
I put to you is that we're going to need new abstractions in this
space. That we need not only a system but a set of abstractions
that let us and let users reason about what to do.
And our goal was to design a system that would be flexible enough
to cope with a bunch of domains. You should be able to use it
for your logs. You should be able to use it for your digital
video streams if you have cameras pointing at highways. Your
sensor data, what have you. One system that has the right
abstractions for all of these.
And just as a motivating example, let's talk about content
distribution networks. You have many sites. Users are making
requests. For each request, you might save, perhaps, a kilobyte
of data about what they got and how quickly they got it and
statistics about the transaction and the request. And you might
have a simple question like how popular are my websites. And so
in a naive world, what you would do is just back haul all that
data, you would just copy it and then analyze it in one place
with your favorite analysis tool.
There's a problem, and the problem is that you actually don't
have enough bandwidth for that. That you should imagine that
bandwidth, in the real world
that sort of your needs are going
to be sort of diurnal as the request load ramps up and ramps
down. The amount of bandwidth you have will be, let's pretend,
constant over time. In fact, it might be worse. It might be
that the amount of bandwidth you have is inversely related to how
much you need to back haul your analysis data.
And this invites a question of what happens here? And you have a
sort of buyers remorse problem where you bought more bandwidth
than you are really using. Bandwidth is expensive. You pay for
it typically not per byte, but in terms of the high percentiles
of your need or else in terms of flat monthly fee, and the
consequence is if you aren't using it, you wasted your money
paying for it. There's another problem, which is that sometimes
your system might produce more analysis data than you can copy
right then and there, and you have a sort of, let's call it,
analysts remorse problem where there's data you wished you had
and you don't have it. And now, this is a problem. And I want
to talk a little bit more about that problem. What actually
happens in that case? You have some bandwidth. You have some
need. What's going to happen here? And in particular, I want
you to think about latency. And I want you to all spend a moment
and think about what this graph will look like. It's going to
look like that. That while there's enough bandwidth, the system
will be okay. As there stops being enough bandwidth, the queue
will build up. As you have a large gap between the bandwidth you
have and the bandwidth you need, your queue size will grow
without bound and then your system will fall over or else it will
have to have some ad hoc mechanism in place for coping with that.
Our goal is to fix that in JetStream so that the system will use
the bandwidth that it has, no more, no less. We will use it
efficiently and, therefore, you will be sort of better off in
terms of both your analysis and your cost. That we will adapt to
shortages if there isn't enough bandwidth, the system will send
less. And then it can sort of go back and fill in the gaps
later. And we need new abstractions to do this.
Let me say a little bit about the system architecture. It's a
data flow system, sort of similar to other streaming query
processors. The model is that the user, in their program, sort
of specifies a query graph as a network of operators. This is
then handed to a planning library, which will optimize it and
figure out where the things go. It's handed to a coordinator.
The coordinator then hands it across to a data plane. At the
data plane, you have potentially multiple sites. You could have
this point of presence, that point of presence, the home office
with the big data center, and in each of those locations you may
have worker nodes. You may have stream sources. The coordinator
figure out where everything goes.
Let me now pop up and give you the sort of soft review of what
really these query graphs look like. You might imagine in this
example of the content distribution network that you have logs
that are written by your legacy system. There's some operator
that reads that file. That hands it to some other operator that
parses out the lines. This then goes into local storage, where
it will sit. Every ten seconds, you query your local storage at
both this site and at another site and the results go forward to
some central datacenter where they are queried and the results go
forward.
And because we have this local storage, the system is able to
adapt and in the event that you didn't have enough bandwidth,
data is still there locally and in this distributed way so you
can go back and pick it up. But how is it really going to be
stored? We have choices.
There are two sort of bunch of things that we want our data
storage to have. We would like it to be updatable so that you
have streaming data coming in and then you can have one cohesive
representation of it, which is hopefully smaller than the full
stream, so it needs to be the case that you can update it in
place. Needs to be the case that you can merge it. That if you
have data stored here and data stored there, there should be some
way to represent the merged data naturally. And it ought to be
reducible. It ought to be the case that if I have too much data,
I can produce a sort of compact representation of it. These are
the sort of things we want.
And it turns out that these are not actually such standard
properties, right. That if you have raw byte strings, like a key
value pair, you don't have any semantics. The system doesn't
know how to merge it and doesn't know how to update it. It turns
out database tables also have the same problem. That actually, a
table doesn't tell you what to do with it. If you hand me a
table and a tuple, I am confused since there isn't a unique right
answer for how to merge those. And there certainly isn't a right
answer if you have a database table for how you produce a smaller
database table.
Happily, there is a representation that does the thing we want.
It's called the data cube. It's the thing that the OLAP analysts
came up with in, I believe, the '90s and it turns out that it has
all the properties we want. Let me tell you a little bit more.
It has a high level API in the same way that sort of SQL does.
In fact, it was developed by database people. Unlike database
tables, it doesn't give you arbitrary joins and therefore it does
give you sort of predictable performance that, by sacrificing the
ability to sort of do arbitrary ad hoc joins in this
representation, we can talk in a much more useful way about its
performance. The data is still there. You can still write these
sort of arbitrarily complicated queries. They don't go through
this abstraction. So they're sort of potentially a side way in
if you have to do something more complicated.
But for our purpose, you should just think about this cube
interface.
>>: Can you repeat your argument why it is, if I have a tuple on
the table, I can't merge or aggregate that?
>> Ariel Rabkin: It's not that you can't do it. It's that the
table alone doesn't tell you how. That what you wind up doing is
you write a SQL statement that says take this tuple and on update
add or on update max or on update this or that or the other.
That the way SQL is set up, the thing that you do when you have
new data that you want to add must be specified with respect to
that data addition. That it doesn't come with a table. It's a
sort of separate part of your
>>:
it.
It's not generic?
I mean, you can certainly write SQL to do
>> Ariel Rabkin: You can for a table write SQL for how to do
this, but it's not part of the schema. And there's nothing in
the spec or the sort of understanding people have of databases
that enforces any sort of consistency here. There was another
question I wanted to
>>: I'm not sure if this is the same question or it's a
different, related question. Which is is there a generic way to,
in databases, you specify a query as a standard layer of the
query that can work between your high level query and this level
of manipulation of the indices. Is there a
can you get back
to something like arbitrary joins in the data key representation
where you write a query and it transforms in some generic way
that query down to the [indiscernible] that have to happen the
[indiscernible].
>> Ariel Rabkin:
>>:
Yes.
So, I mean, arbitrary joins aren't completely ruled out?
>> Ariel Rabkin: They're not completely ruled out, but the point
is that they're not part of the abstraction.
>>:
Okay.
>> Ariel Rabkin: And in general, it's
as with most
abstractions and systems, there's some underlying abstraction
which is more flexible than the thing you gave. And so I'm just
for our purposes, we have ruled them out. We can give them back
to you later.
>>:
Okay.
>> Ariel Rabkin: Take it that way. I want to just tell you what
it is that I gave you, for those of you who are not database
people. The model of a data cube is that it's a multi
dimensional array with some set of aggregates indexed by some
dimensions. To make that more concrete, here's where you are.
Our example, you have one dimension, which is the URL. You have
another dimension, which is the time. And the thing that is
different from this being a database table is that we also have
an aggregation function that tells us how to merge two cells.
This is not something that you have in vanilla SQL. There's
nothing in the nature of a database or the nature of relational
algebra that tells you what do I do if I have two different
relations and I want to get one relation out. A data cube has
that. It has this aggregation function.
And once you have that, you can do a lot. You can roll up your
data. You can take some set of cells and squish it down and give
you, for instance, all of the accesses to that URL at all times,
or you could ask about the sort of total number of requests at a
particular time.
And we use this one function for updates, for rollups, for
merging, for degrading. And this is the key thing that we need
is some semantic about how to manipulate the data. And it's just
one function so it's sort of the minimalist possible operational
semantics, let's say. And once we have that, we can do a lot.
In particular, we're now going to modify the data dynamically
based on feedback control that we can look at the network and we
can look at the data and because we now have a way of updating
our data, we can produce a smaller version of it for copy. And
the feedback control will tell us when to degrade, and there will
be a user defined policy about how to degrade. And then because
the data is there, you could do later queries to pull back the
bits that you didn't get the first time through.
There's more than one way to degrade your data. And that is why
we need a policy. That you could imagine, for instance, that you
had data every minute and you wanted data every five minute. We
will call this dimension coarsening. That is, you have many
samples and now you're going to have fewer samples. And it
doesn't have to be time. It could be that you had data at every
URL, and you'd like to have data at every domain or at every sort
of prefix of the URL.
A different thing you could do is drop low ranked values. That
you had some curve of these are the very popular URLs. These are
the less popular URLs, and you just drop the tail. That's a
different transformation. You could do either, you could do
both. And, in fact, there's a lot of things you could do. Here
are five we came up with. This is not an all encompassing list.
These are sort of five variable data degradations.
There's coarsening, there's dropping values. It turns out there
are sort of global protocols that let you do this in a consistent
way where you drop things not base
yeah.
>>: Question go back to the first box. For dimensions, in your
abstractions, are the dimensions fixed like a SQL table, or a
dimension can be added?
>> Ariel Rabkin: They are fixed for the cube. The cube schema
is very much like a database schema in that changing schema is a
heavyweight operation.
>>: Just for clarification, going back to the first question, I
didn't quite understand, with you showed the tables, so the
tables have a schema. You can write a SQL query to do, say,
aggregation. So what is the exact difference?
>> Ariel Rabkin: The difference is that with a cube, the
dimensions and the aggregates are different, whereas in SQL,
there is no such distinction. You can define data cubes in terms
of SQL and, in fact, that is both how we implement it and how
they have been historically defined. But we are going to take
that abstraction and put it in the middle of the system and we're
going to sort of use it only through this more restrictive
interface.
>>:
Can you give me an example?
>> Ariel Rabkin:
Of?
>>: For, you know, if you did something with SQL, you know,
here's how you would do it. And if you did something
>> Ariel Rabkin: Yeah, so the way that we implement the
aggregation function is we write a complicated SQL statement,
right and that sort of based on the cube schema, we are able to
produce these. But you needed to ask the user really what did
you mean and what kind of aggregate is this. Is this a maximum?
Is this an average? Is this a median? Right. You have choices
there which are not visible actually at the SQL layer. And so we
are sort of bolting a thing on pop that specifies these
semantics. And if you need a longer explanation, I want to do it
later.
I want to first just tell you that you have, once you have your
cube's many choices for what to do, right. That if you had, for
instance, a histogram in your cube, you could have chosen to down
sample this histogram or you could have chosen to keep fewer
histograms. And there are trade offs. In particular, the trade
off I want you to notice is that most of the time, you have to
choose between a fixed bandwidth savings, that there's a
transformation that predictably will give you half as much data,
and then there are transformations that have a fixed accuracy
cost. And in general, you have to choose. And that it's not the
case that there's an all purpose transformation that has
predictable consequences, both for the size of your data and the
accuracy of your data. And that fact would be important in a
minute. So jumping back to the system, the first thing you might
say is, well, I have a feedback controller and I will just pick
an operator and put it in my data flow graph and that will
specify how to transform the data. And the operator is then
attached to some controller. You specify the policy by fixing an
operator, and then you have a sensor that says you're sending
four times too much data back off that much, and then the
operator reads the sensor.
Let me give you an example of where you don't always have a
predictable bandwidth savings. Let us suppose that you have
decided to aggregate your data over time, that you had data every
five seconds, and you want to coarsen it to every minute. The
amount of data that comes out of this is not predictable in
advance. That if you did this for domains, you'd get a large
savings. This is, by the way, data from the CoralCDN. And if
you do this for URLs, you get no savings. And the reason is that
proximately every Coral URL is unique. People have these
queries. The queries have a large query string. Those don't
repeat. The domains do repeat. And so over time, you get
savings as you start to coarsen this and have data for this
domain only every minute or every hour. But it's a totally new
set of URLs. So there's no savings. And you might plausibly not
know in advance which case you are in, right. This is, of
course, a continuum. You don't know which case you're in in
advance. This depends on your users, your users change over
time. And so a natural thought is I will have a composite
policy. That I will have two different operators that will read
the sensor, and then I can do a little of each. What happens
here? It will pause for a moment and contemplate what happens if
you try and do this in the naive way. So you get, in a precise
sense, is chaos. That if you have two different actuators driven
off the same sensor, you don't have a stable feedback loop. This
can't work. There's a more subtle problem, which is that
actually operator placement is not free. That there's
constraints on placement due to the sort of underlying data flow.
That, for instance, the way we do coarsening is that there's a
thing that's querying a cube and it does a query rolling up to
the minute level or the second level or what have you. And that
query has to be next to the cube because of the sort of nature of
the data flow. So it's not the case that you could put your
operator where you like. So you can't use that as a cue to what
to do.
So the natural thought is let's have a controller. And then the
controller will be able to tell the operators what to do and will
specify the priority. And this is nice because
by the way,
the thing I should say is this is on every network connection.
This is not global to the system. It is not even global to the
machine. Every time there's a network connection, there's a
little controller with a policy that says first apply this, then
apply that. And the controller reads the sensor and tunes the
operators to do the right thing.
And so there's a potential to put a policy on every one of these,
and that policy can say, first, apply this then apply that.
Apply this only up to a certain point. You can specify what the
degradation behavior should be. And this is good because now
we're no longer bound by the topology. Which operator you apply
first is unrelated to the structure of your data flow, and that's
good. There's a problem, and the problem is that we do need, at
the end of the day, to take the data that's coming from here and
the data that's coming from there and merge it together.
And that's not, in general, trivial. Let us suppose I have data
every five seconds and then at some other site I have data every
six seconds. What do you do? The thing that represents the
data, quote, exactly, the sort of the natural representation, is
to do it every 30 seconds. And that's really bad, because you
just dropped a large chunk of accuracy that you didn't need to
drop. And we think that the fix
this is bad. We think that
the fix is, instead, to jump from every five to every ten. That
if you couldn't afford to send data every five seconds, you
should start sending it every ten seconds. And now it's the case
that this can be merged after the fact without an additional loss
of accuracy. That this gives you sort of better degradation
behavior and better semantics.
The thing I want you to notice is you can't clean the unified
data at arbitrary degradations. This is not unique to time
series data. This is the case, for instance, if you have a
sketch of a distribution, right. There's these sophisticated
data structures called sketches that let you represent an
approximation of a whole statistical distribution for things like
finding compiles. These come in fixed sizes. It's not the case
that you could make them ten percent smaller. You often have to
go down a factor of two. And so the system needs to know that.
And the degradation operators need to have fixed levels.
So now, this is therefore going to really shape our final
interface here. That the model is that you have an operator, the
operator goes to the controller and says, I have a bunch of
choices for how much data to send. I'm currently dropping half
the incoming data. It could be zero. It could be 75 percent.
What do you want it to be? And the controller, which has access
to the sensor, is able to pick the level that will use as much
bandwidth as is available and no more.
And so that's the interface. And the thing to notice is this set
of levels can be determined dynamically, that the operator is in
a position to look at the statistics of the data and estimate
what it will do. And this is quite a valuable point of
flexibility. And another thing that you can do as a result of
the semantic is that the operator can put the level changes at
semantically meaningful points, right. That you might want it to
be the case that for every minute or every hour or what have you,
you have some consistent accuracy or that, you know, you only
switch
for instance if you're doing something audio visual,
you might want to only change frame rate at some point that makes
sense with respect to the underlying code. And our interface is
flexible enough that we can do that, and that's important.
This really works. It turns out. I will now give you some
experimental evidence. We used 80 nodes on the VICCI test bed.
That's the sort of descendant of Planet Lab. It's the same
model. You get slices on a bunch of machines. Happily for us,
there's hardly any users and so we had relatively clean
experimental conditions. The data is back hauled to Princeton
and we will drop data if there isn't enough bandwidth. We ran
this twice. Once with and without our mechanism so you can see
the difference. Without degradation, we ran the system for about
40 minutes, and then we turned on bandwidth shaping. We sort of
told the Linux kernel, only send this much data. And so
bandwidth usage drops due to the bandwidth shaping. When you
remove it, it, of course, goes back up as it drains its queue.
This is what happens to latency. And this is not good. I'm
showing you the median, the 95th percentile and the maximum. And
the thing to notice is all of them start growing rapidly as the
queues build up, and then when we turn off the bandwidth shaping,
the queues start to drain. The median node drains quickly and
only takes about ten minutes for it to recover. The 95th
percentile takes a good long while. It takes something like 45
minutes before the 95th percentile latency recovers. And the
last node takes an hour to recover, right. That node is starving
for bandwidth. That's bad. This is the sort of precise
experimental version of that figure I wanted you to imagine
earlier. This is really what it looks like. And this is no good
if you're trying to do streaming analytics, because that says if
there's a bandwidth glitch, your streaming is going to fall
behind. The adaptation fixes this problem. Here's the same
experiment. Here I'm turning on bandwidth shaping. I'm turning
it on twice to show you that there isn't a transient there.
This is what the median latency looks like. It went up to
15ments before. Now it goes up to eight seconds. That's a big
win. This is the 95th percentile. You'll notice that it's not
that much worse. You'll notice that the 95th percentile again
recovers quickly. Just to show you I don't have anything up my
sleeve, here is the maximum latency, the graph is a little ugly
because it turns out that nodes don't
you get sort of aliasing
effects because nodes don't always report, but even the maximum
latency only goes up to about 30 seconds. So life is okay,
right? That's okay for that worse mode, worst of the time.
Mostly, the nodes are quite good. It's only a few seconds.
There's no really great way to evaluate programability so let me
tell you what we did. We picked eight queries drawn from
operational experience with Coral. We coded them up. We
discovered that we could code all of them in somewhere between
five and 20 lines of code, a few
one of them was about a
hundred lines. That did some complicated two round thing where
you would look at what was the
yeah?
>>: Just to clarify on the previous slide, you are getting two
different results in that one of these gets you degraded data and
one of them doesn't; is that correct?
>> Ariel Rabkin:
Yes, that is the case.
>>: Okay.
data?
And does this one ever get you back to the original
>> Ariel Rabkin: It doesn't do it automatically in our current
implementation. You can go back and query that data. That datas
all stored. We haven't thrown it away, but we aren't doing it in
a streaming way, right. The logic of this was that if you're a
user, often late is not actually an improvement on never. That
if you're making some decision now, the fact that you can
ultimately get back that data is not extremely exciting.
So it made sense to say that we don't back haul by default. That
if you didn't look at your logs immediately, we assume you might
never look at them and so we will leave them where they're
created.
>>: So in a way, we've shifted to a problem of people
people
don't know how to program systems that will use the bandwidth
appropriately and now we're hoping that they know how to trust
the statistical validity of degraded data with degraded data?
>> Ariel Rabkin: We can tell them how statistically accurate
their data is. Often, you can get a quite good error bound. In
particular, if you have a sketch or a histogram, those come with
really well defined error bounds. There's no difficulty in
telling people your data is accurate to within two percent.
Likewise if you're dropping low ranked values, it's quite easy to
know actually what the biggest effect that could be was.
So we can tell people this is the statistical validity of your
data. We have that.
>>: How sensitive is this experiment to the topology of the
[indiscernible].
>> Ariel Rabkin: We couldn't arbitrarily vary the topology, and
we couldn't
so we don't have a great way to tell you about the
effect of the topology of the underlying network.
>>: But it is topologies does affect this results. In other
words, if you had a different topology [indiscernible], these
results could be a little different.
>> Ariel Rabkin:
>>:
Presumably, yeah.
What is the topology of VICCI?
>> Ariel Rabkin: The topology of VICCI is that the diff
represent sites are attached to internet2, and the question now
is what is the topology of internet2, and the answer is I don't
really know. I don't believe we're saturating that network. The
bottleneck here, I believe, is the Gateway to Princeton.
>>:
The Gateway to Princeton, I see.
>> Ariel Rabkin: That this is about 500 megabits and Princeton
only has a gigabit link. So we're using a large chunk of what's
available there.
>>: So that node that was starving is a Princeton node that is
trying to get out through that
>> Ariel Rabkin: I assume it's a node not at Princeton that's
trying to connect to us and discovers that that gateway is
saturated.
>>:
Okay.
>> Ariel Rabkin: All right. And with degradation mechanism, the
other nodes can be backed off, which you can't easily do
otherwise. All right. Speaking of the difficulty of
configuration and error analysis, that is my next topic.
Software systems have many knobs and switches that, right, often
there's a thread pool or a number of threads or a memory bound.
There's some number that you have to put in by hand. And often,
then, there's a switch which is do you want this feature on or
off? Do you want this always, sometimes or never. What have
you. You have choices. And then there's external identifiers.
There's places where you need to refer to something out in the
world. You need an IP address, you need a file name. You need
to specify the network interface to bind to. You have these
options.
And I will tell you how we will debug them, particularly these
last two kinds that you have some option that's just totally
wrong. It's not merely inefficient, it's wrong. And as with
microphones, often you don't really know what's wrong. If you
start Hadoop out of the box without configuring it, you get this.
And this is not very helpful if you are a novice user. You have
a stack trace with a null pointer exception, what do you do? If
you're an expert, it's great because you can read the code and
understand what it means. But if you're a novice who doesn't
want to read the code, you're very confused.
>>: Can I ask you a scoping question? The thesis here is that
these configuration challenges are coming from the fact that the
user is having to tie together dozens of pieces of software that
fit together in different ways, is that
>> Ariel Rabkin: That is one of the things that makes this
especially hard. But even if you have one, quote, system, that
system has pieces underneath that don't always align perfectly.
>>: I guess that's what I'm asking is historically, you start
out with these sort of
with systems that are designed that
only engineers could love because they're made out of
[indiscernible] that you can rearrange. And at some point, we
understand what the killer outs for them are.
>> Ariel Rabkin:
Yes.
>>: And then some company comes and builds a monolithic tower or
something called Office with Excel and everything and says here's
a nice package. And no, you're not going to want to see the null
player exception but that's okay because we tested it. So, I
mean, is there a reason to believe that managing a stack of
bricks is actually going to be a long term phenomenon, or are we
going to learn what the sort of named motivating apps are and
then build single coherent packages that are well tested?
>> Ariel Rabkin:
That's a really good question.
>>: Feel free to defer it.
fits into the
I'm just trying to understand how it
>> Ariel Rabkin: So one thing to say is your office runs on one
machine and your distributed execution engine does not. And so
just intrinsically, you have now more bricks if you are in a
distributed world, because you need to specify what the hardware
is and what the network is. And I don't think that really is
going to be monolithed away. I suppose that there are companies
that want to sell you an appliance and that's the way that we
sort of hide that. But unless you believe that the processing
appliances will take other the world, yeah, we're going to have a
lot of software pieces.
Also, the sort of economics of development are such that the size
of a coherent, fully engineered solution would be really
infeasibly big when you start to get to multi million line
software systems. There's very few companies that can deliver a
product the size of Office, and I would be surprised if anytime
soon the whole ecosystem of distributed processing gets that
consolidated. So the answer is it's not intrinsically impossible
to deliver shrink wrapped software in this environment, but it
isn't happening soon. And if you want a longer answer, we can
discuss more later.
So what do you do if you have this stack trace? So the natural
thought you might have is there must be some knob I can change.
Hadoop has knobs. It has many knobs. You're not out of knobs
yet. I stopped at M. So you have hundreds of options. And
this, by the way, is not a fanciful problem. I spent a summer at
CloudEra. I looked at their trouble ticket database and the
thing you learn is misconfiguration is the biggest problem. When
you measure by supporter time or supporter number of cases,
mostly it's misconfiguration, or at least that's plurality of the
problem. That is a bigger problem than bugs.
In this case, a milligrams configuration is a thing that you
fixed by changing the configuration. A bug is a thing where we
went to the developers and said you need to patch this. So
mostly, the problem is in the configuration, or at least, sorry,
the plurality of the time, the problem is in the configuration.
So that's a sign that this is really where we should be devoting
our effort.
And because the software is sort of bolted together from bricks
and because different users have different use cases and because
no two MapReduce sites are quite alike, often the error messages
are rather unhelpful and often it's sort of hard to figure out
really what your problem is. That people routinely ask for help
and don't get it, which is why there's a support business.
What do we do about this? Well, one approach to automated
debugging is collect a lot of data that you can take some program
and you can instrument et in six different ways. You could do
dynamic instrumentation, you could watch the system calls it
does, you could sort of profile its execution in various ways.
You could just collect a lot of data and use that to match it
against some either library of known problems or to figure out
where in the program it went awry. And this is great if the
program is running on a machine you control. But often, the
program is running on a machine you don't have full access to.
In that case, wouldn't it be nice if we could just search if you
could take the error message and put it in a search box and get
back a result.
And I will tell you how to do this, and this is great because it
requires no access to the site where the program is running. It
requires no modification of anything.
install. You just search.
>>:
There's nothing to
You still have to instrument?
>> Ariel Rabkin: No, there's no instrumentation happening and I
will tell you what we do instead.
So the goal here is to resolve the misconfiguration with only the
error message, whether or not we've seen it before. All we get
is an error message, right. So this is sort of the minimalist
thing you could have. If the user says help, it's broken, with
no details, no. But otherwise, this is probably the least we
could possibly assume, which is good. And because all that we
are going to rely on is error messages, we could do it all in
advance. That there's only a finite number of points in the
program and so for each point, we could build a table which is
what was the option that could have affected
could have caused
an error there.
We're going to build a table, and we're going to do it with
static analysis all in advance. And at the end of the day, when
the user has an error, they can go to some diagnosis service,
maybe with the app, maybe on the web, and they can do a query and
they can get back a result. The assumption here is that the
developers are minimally friendly. They at least will give us
access to friendly compiled binary. We don't actually need
source, but you do need debug symbols to know line numbers. The
assumption is we have that and use some insight into the
structure of the program. We don't need to run it. We don't
need anything else. Yeah.
>>: Why not change the lines of code into helpful error messages
instead?
>> Ariel Rabkin: I give you two answers, the first of which is
you don't always know in advance what the helpful error message
would be. Since if the problem is that it threw an exception,
it's sort of hard to figure out all the possible exceptions that
would be caused by misconfiguration and hard to think about all
the possible configuration causes for errors. So we don't
actually, in general, know what those error messages should be.
The second thing to say is that you don't always have the ability
to make those changes. That often, it takes until fairly late in
the release cycle before you understand what can go wrong and
people do not want to put in a lot of new code to log things at
that point. So from an engineering point of view, that actually
is quite painful.
It took, I think, eight versions of Hadoop before they patched
that silly crash on startup, right, which is like the very most
obvious thing, because it's fresh out of the box. If you can't
patch that, I think that just says that you're not going to get
very far telling engineers to write better error messages. So
we're going to try and clean up after them.
And the [indiscernible] that is if you have some exception, you
can just look it up in this table. The table then matches, you
say, aha, line 200, I know what that, and back comes your
response and life is good. The semantic here, by the way, is
possibly responsible. I make no claims that it's always
necessary one of these or that it's always included. That will
be
I'll show you empirically good enough. This is the part
that's hard, the static analysis, the rest is a RegX or a web
search. And this is work that appeared sort of in software
generating venues a couple years ago.
So let me now say a little bit about how this happens. So
there's this stack trace. What really is going on under the
hood, to give you the flavor of why configuration bugs creep in,
is there's some configuration option, which is what is the file
system I should be binding to. And this is null by default. And
then when you concatenate null with a port number, that's going
to fail. You can imagine how this happened, right? One
developer writes this, one developer writes that. They're
testing only in a sort of well managed cluster where that option
is always set. They didn't bother to look at what happens if the
user despite put anything in the configuration file by default.
Oops.
How will we fix this? Well, we're going to do data flow on the
configuration options, that there's a point in the code where it
reads an option out of the config. This has a key value
interface, and then we can sort of propagate those labels through
statically. That we're going to do data flow, and we're going to
say aha, if there's a use of a value, then the output is also
labeled. There's a points to analysis in the background and so
we can sort of trace values through the heap. If you stuff
something into an object and you can read it out on the other
side, that will get sort of picked up by the analysis.
It turns out that you have choices at this point. That there
isn't one standard way to do points to. There are many choices.
And you have to choose between a sort of liberal notion of data
flow, which you sort of include all the possible ways values
could flow, and a sort of strict notion of flow in which you
might miss some flows. And if you take the liberal approach,
your analysis will produce more false positives and if you take
the strict notion, it will produce more false negatives.
And so when you are designing this analysis, you must sort of
steer between the monster and the whirlpool. And just to put
some sort of concrete techniques down, if you want to be liberal,
you need to do inter procedural control flow. You say, well,
whether or not this method was called depends upon this option.
And you want to do a sound point to analysis that picks up all
the possible points to dependencies. Or you could be strict and
say, I will only do a limited control flow analysis, and I will
ignore some possible points.
>>: Seems like a crash on startup is sort of the best possible
case for this, because at that point, that code's only been
tainted by one or two options.
this will
>> Ariel Rabkin:
But in the middle of the program,
This is a static analysis.
>>: So even so, in the middle of the program, say deeper in the
program, as opposed to the middle, there are hundreds of options.
>> Ariel Rabkin: Yes, this is an unusually friendly case, which
is why I wanted to pick it to talk, about to sort of make the
analysis a little clearer to describe. I will show you
empirically that this works, don't worry. That was not going to
be my only example.
Let me now talk a little bit more about why system software is
hard and why sort of applying static analysis to this kind of
problem is not a straightforward exercise. Let us suppose you
have some program where main calls method A and there's no
visible caller of method C. If this were undergraduate
programming analysis, program analysis, you would say aha, I will
write an inductive rule for reachability. Main is reachable,
that's my base case, there's an inductive rule that says if A
calls B then, you know, B is reachable if A was reachable. And
so if there's no caller of C, C is not reachable.
This is not how system software works. That in the world of
complicated software systems, there are RPCs, right. Stuff is
invoked remotely and there's some reflective glue that makes that
happen. And you could try and do a sound precise reflection
analysis. This turns out to be, in general, intractable, and so
the fix that I did instead was to just label the classes that are
exposed remotely. And this is actually not hard. There's two of
them for a program the size of Hadoop, four of them or something
like that. It's a handful.
So the network interface that the system exposes and you say that
class is the network interface. And this requires a minimal
amount of insight about the system, but only a minimal amount.
This did not require deep surgery. Once you do this, you have to
adjust the points to analysis to compensate, right. That if
stuff is invoked remotely, you need to know upon what object it
was invoked, and this then requires that you adjust the
underlying points to cope with this fact that they're sort of
call change that didn't come in sort of through main.
And once you do this, it turns out you can actually get static
analysis to run on these complicated distributed systems that it
has not historically been run upon. And that's an achievement.
There's another problem which is that it doesn't scale. The
static analysis is exponential in the worst case. And in the
concrete cases I care about, it's really bad that Cassandra,
which at the time was 10,000 lines of code, so quite small, took
three and a half hours to run. Took a month on Hadoop. And
that's really bad, because there's a new version of Hadoop every
six weeks. So if your analysis takes a month, you're in trouble.
And, of course, there are larger systems than Hadoop.
So this is a problem. Happily, there is a fix and the fix is
that we can buy adroit choice of heuristic, get this down to
something manageable. And, in fact, for Hadoop, we can get it
down to about 20 minutes. That's a 2000x speedup. That's enough
to be very happy as a system researcher.
What is the problem, really? So it turns out that the problem
really is the libraries. That when you have a program, you think
you have a program, but actually you have this sort of giant
iceberg of underlying code. That there's a standard library
which is usually much larger than your program, and there's all
these of a third party libraries that you have linked against.
And the problem is that the analysis is spending its time there,
and it's spending its time there because you actually have a
genuinely different points to groove in different programs that
use different subsets of the library. So you can't just analyze
it once, because what points to what did depend on what you did.
Fortunately, there is a way for our purposes that we can ignore
all this code, that we can sort of cut on this dotted line and
just model the library instead of analyzing it. Let me now tell
you how. So turns out libraries are special in a deep way. And
the deep way is this. In a static analysis, if you're analyzing
method A and it calls method B, the analysis has to go and
analyze B in this concrete call site and look at it with these
arguments, right. You have to trace this data flow through
method B.
Library code is special. It can't modify the user heap. There's
a sort of dual contract to protect you that the type system says
the library can't really access the fields of application
structures because it doesn't see their types and there's a
social convention which is that it would be terrible manners for
the library to try and evade this with reflection, right. That
in general, the library won't do that to you for good reasons.
And likewise, libraries don't have global state. Again, this is
enforced to some extent technically and to a large extent
socially that it would be a very bad library design if data flows
in here and flows out there, and there's no visible connection
between those sites.
And if we assume that these rules are actually followed, then
there's suddenly no need to analyze the library. That we're
guaranteed that any data flow through the library will be local,
that if a configuration option goes in, it will come out right
here. That either the return value or the receiver object will
be tainted but not some arbitrary other structure.
>>: I wanted to question the second one about global state.
Plenty of libraries have [indiscernible] where you create a foo
and you get something back that's kind of like handle for a foo
and then you do various
>> Ariel Rabkin: That's okay.
come back tainted.
That will work.
That handle will
>>:
Even if it's [indiscernible].
>> Ariel Rabkin:
objects.
Yes.
The analysis treats them as opaque
>>: So the thing that you are asserting libraries don't do is
tuck that tainted state away in a static global that only the
library can see and a different [indiscernible].
>> Ariel Rabkin:
doesn't happen.
Right.
I am asserting that that normally
>>: Okay. So it's okay to recognize the static global if it
indexes into it
>> Ariel Rabkin: Yes. That's fine. The analysis will do the
right thing there. You might, at this point, get nervous. I
will reassure you in a moment. Let me first reassure you that
this runs quickly. That whatever it gives us, it gives us in a
hurry. That for something the size of ANT or FreePastry, which I
think are half a million lines of code, this takes half an hour
for the component of Hadoop which are order of 70,000 lines.
This is a couple minutes. That's okay. This would have been
totally off the charts otherwise.
So we really needed these heuristics or something like that.
needed something to make this tractable. Now let me try and
directly address those questions. Yes?
We
>>: FreePastry, you said, is half a million lines of code
without libraries?
>> Ariel Rabkin: It's up there, yeah. Maybe it was 200,000
lines. I don't remember. I'd have to look that up. But it was
large. And because it was written by researchers, it has sort of
quite convoluted code. It has the sort of continuation passing
style with a lot of call backs and that turns out to create very
tangled points to graph. So it was slow to analyze.
Let me now talk about accuracy. And for accuracy, I'm going to
look at two programs. I'm going to look at Hadoop, and I'm going
to look at the JChord analysis tool. So these are sort of large,
complicated configuration heavy systems programs that I didn't
write but have used and, therefore, am qualified to evaluate on.
Because I some idea if the results make sense.
To evaluate, it turns out there's a tool called ConfErr, which
does false testing for configuration. You give it a program in a
working configuration and it will permute that working
configuration until it crashes. In running this, I came up with
18 failed instances across the two programs. So this is a set of
configuration mistakes that came
that were not created by me.
I have nothing up my sleeve. I didn't invent them. We're now
going to use these to check if the analysis is correct, right.
That we know the root cows, because we injected it.
We can find out what the analysis tool finds. And what you get
is I'm going to show this as false positives versus false
negatives. That the analysis gives you 80 to 100 percent of the
correct guesses and it gives you only a couple of diagnosis each.
You give it an error message, and it gives you back three
guesses, usually clouding the right one.
To help you interpret the plot, the theoretically ideal is there.
That's really the theoretical ideal. That's in the case where
every error message has exactly one cause. And every problem
leads to exactly one error. That doesn't happen for real
software, but that's sort of the very best you could do.
And note that these programs have hundreds of options so if you
did the naive thing and just returned everything, you'd be,
again, off the charts. So three guesses, that's not so bad. We
can do better. I've been talking about errors as though they
emerge at a point. Actually, they emerge from a path. There's
the point where an exception is raised and then there's the stack
trace that got you there. And we can use that whole stack trace.
That in particular, when you have an exception, this stack trace
is telling you the calling chain that got you there. That method
A could have been reached from any of the various paths. This is
the one that caused the exception. And you can do a static
analysis where you ignore all the other call chains and where you
ask only about the data flow that can reach a given point via
that chain.
So this is now a static analysis that's parameterized by this
stack trace. And that's actually quite quick. That here, you
really can reuse the points to and everything else because it's
the same across different analyses of the same program. And the
consequence is that this is now very quick to do. And so you can
imagine doing some caching so that the user doesn't even see
that, but seconds, that's not so bad.
There's another thing we can do, and this maybe bears on what you
were hinting at earlier about early in the program. As a program
runs, it will look at an option and then it will look at another
option. And you might imagine that it would normally read option
C and D, but instead it crashes. This told us something. This
told us dynamically for this concrete failure, C and D could not
possibly be responsible, because the program never looked at
them.
And so if you record dynamically concretely at run time what the
program read, you now know which options it didn't look at, and
that really helps you, all right, that you can sort of trivially
then filter away the things that never came in. And now using
these two techniques, we can now get a substantial precision
improvement that using the stack traces, we get a noticeable
improvement, and now using this login, we get another
improvement. And now we're really quite close to that
theoretical ideal and life is good.
A thing to notice is some programs have been, quote, helpful and
produce a, quote, friendly error message and not a stack trace.
This is good for humans and bad for automated analysis, because
you have less information about the state that got you to that
error. And JChord, which was programmed cautiously and
defensively to produce a lot of error messages doesn't tell you
by what both it got to that error, and that's why you didn't get
an improvement by looking at stack traces. So stack traces
really are useful if you can get them.
>>: So I'm still a little worried that you sort of motivated the
problem in the beginning by saying that CloudEra had a plurality
of problems with configuration. Are those all problems that
would be addressed by this, or actually an even better question
might be, how many of those would be addressable by Google, which
is in some sense the trivial way of solving the problem.
>> Ariel Rabkin: None. They wouldn't have filed a support
ticket if they could have looked it up on their own. By the time
they file a support ticket, smart people have been stuffed for a
while.
>>: So does this
is there any [indiscernible] looking at some
of those cases and determining
>> Ariel Rabkin: This would have helped on some. It turns out
that there's a piece of this analysis that is less interesting
from a research point of view and more useful practically, which
is we can produce a list of what are all the options. And that's
really useful as a supporter, since it often turns out that
people have like a typo in their configuration, and being able to
just statically extract a list of options and what their types
are, that's huge. That really made it into production use at
CloudEra. The answer is half of this. The half of this that is
less research interesting was more practical, as so often happens
in life.
But a large chunk of this sort of static analysis machinery was
actually necessary. Did that partially answer the question?
>>:
Yes, partially.
>> Ariel Rabkin: I mean one way to say it, in other words, let
me give you then a better thought, which is that there's
different classes of users who have different classes of bugs.
The class of bugs which this work is directed to is I am a non
expert. I am trying to set something up. It isn't working. And
in these cases, typically, the errors are going to be kind of
blatant. They'll come up on startup. The kind of errors where
you pay a support company a lot of money to fix it for you, those
are the hard errors. No, this technique is not the one you would
use for the hardest possible errors, right. We sort of filtered
out everything they could do by Google, everything they could do
by fiddling for half an hour, all of the easy cases have already
been stripped out.
It turns out that when you automate the easy cases, that is not
you don't see that benefit at a support company. Users will see
it. These bugs do happen. I have a lot of anecdotal evidence
that people do hit silly mistakes. It happens they don't file
support tickets at an enterprise company for them. There's a
longer discussion about what pieces of heavy machinery are useful
at an enterprise support company, which we can have offline. The
answer is a lot. Although admittedly not this.
I suspect that a large fraction of the silly mistakes you hit as
a novice user are misconfigurations. You don't really hit bugs
typically in those cases. So you should expect that the set of
problems that you care about as a non expert actually is very
heavily skewed towards misconfiguration. So I think the
motivation is still valid. Was there another question there?
Yes.
>>: So it sounded from your methodology you just generated 18
random errors. Why wouldn't Google help with that?
>>: Because there's two things to say. The first of which is
that people are really bad at documenting these things publicly.
That actually, the state of public discussion of these sorts of
bugs is not amazing. That often, you just get nothing. That's
the first thing to say.
The second thing to say is, JChord has, I think a dozen users. I
mean, it's a tool that you use if you're a static analysis
learner so it does not have a highly public support. There's a
lot of software that has a dozen users and so I think that being
able to help those users is meaningful. This is sort of great on
the long tail where you don't have enterprise support companies
and high traffic forums. That's the first thing to say.
The second thing to say is particularly something like Hadoop,
which is really a moving target with lot of versions, the thing
that was the right answer a year ago is not the right answer
today. And you get actually worse than useless results from
search engines because you get a lot of answers that would have
been right three years ago. And there's no mechanism to
deprecate them in says actually we fixed that problem. And if
you see this error today, it's a different cause than it was
before. And users get confused by this.
Next, okay. Let me say a little bit then about my next steps.
As I mentioned, I want to make modern software systems easier for
non experts. One of the ways that you get in trouble is, as was
mentioned, that we have many bricks. That we have a big software
stack, often something breaks at the top level. Often there was
an underlying cause at a low level and you'd like to tie them
together. Sometimes it's the reverse. That there's an
underlying cause at a low level that some ethernet cable fell out
and you see a high level thing which is your overnight script
failed to complete.
And I would like to be able to tie together the symptoms and the
causes sort of through the software stack. I want the sort of
cross layer visibility. I want to sort of drill down through the
layers and tie it all together. Let me be more concrete and talk
about permissions. That if you have a permissions problem, if
something isn't readable, that what will happen is there's some
file read that fails. Some input or output will produce an
exception. And if you are debugging this, what you really want
to know is what is being read, what credentials do I have, what
credentials do I need. You'd like to understand the sort of flow
of authority through the system. And this is currently quite
painful to do, since this isn't really logged anywhere and you
don't have good visibility. Note by the way that this is a
security problem since the users will fix this by making
everything world readable and so you should really worry about
this.
Let me give you another example, which is that the way that
MapReduce and its cousins work is that your data is partitioned.
You run a task on each data partition. One of those tasks might
fail. And a very natural question is why is this task different
from all other tasks? Why did this one fail? And there's this
advantage which is that the situation is symmetric. That it's
really the same code and on similar data.
And there's a technique called statistical debugging that's been
kicking around for a while, for sprinkling predicates through
your program and doing a statistical analysis to figure out where
things went awry. And this is an unusually fertile domain for
it, since it's really the same code and it's all in a totally
managed environment where you can instrument it.
There's a wrinkle, which is that people don't really write
MapReduce programs that often. They have high level tools like
Hive that compile down to MapReduce. And to make this really
usable, what we'd need to do is invert that mapping and figure
out from your analysis of where the underlying program went
wrong, what went wrong at the level of the user's abstraction
that you have to sort of translate this back up through the
stack.
And this is not an insuperable problem. Debuggers do manage to
do this. How do you do it for something sort of complicated and
ad hoc like in MapReduce scripting language?
research problem.
That's a hard
Let me sum up. I believe I am now out of time. I've told you
about JetStream, which was about how we configure
how we can
support configurable degradation policies to manage bandwidth
efficiently and I've told you about my work on configuration
debugging, where we match errors to causes with static analysis.
More questions?
And thank you for having me.
>>: I'd like to return to JetStream. The second half gave me
some time to think about some questions. So in order to do these
control functions that you're describing, you have to have a
knowledge both of the limitation of network bandwidth and the
amount of data that's flowing to it, and then secondly, you've
got to know something about these operators that are going to
compress and/or grade the quality of the data.
What did you do in both of those areas?
information?
Who supplies the
>> Ariel Rabkin: So measuring the data flow is quite
straightforward. We control the flow. We can measure how many
bytes we one of the out on the network. Likewise, we have
mechanisms in place to measure the latency, and that lets us
watch the queue grow and measure what is the relationship between
the bandwidth that we are using and the bandwidth that the
network is supplying, right.
If the latency doubles when you add ten kilobytes per second of
data, that told you that you needed to back off by a factor of
two.
>>: You don't really need to know much about the topology of the
network? You just
>> Ariel Rabkin:
measuring the
We are not measuring the topology.
We are
>>: Just latency versus volume, and if you see a problem, then
you just presume there's a bandwidth problem?
>> Ariel Rabkin:
>>:
We are presuming that we should back off, yes.
Okay.
>> Ariel Rabkin: You could imagine a more sophisticated version
of this that has actual insight into the network and can know
where the bottleneck is and can say this latency is irrelevant to
me. I will keep sending. This is not really feasible in the
wide area, which is the context we're targeting, because we don't
own those networks, and AT&T or internet2 won't tell us where the
bottleneck is. All we know is the packet took this long to
arrive.
>>: And the second half is there's a cost estimation issue of
knowing what these operators do that supplies that.
>> Ariel Rabkin: Yes.
was so
da, da, da.
>>:
The operator itself is coded with
as I
Part of the operator interface
>> Ariel Rabkin: The interface between the operator and the
controller encodes this. That the operator, when it is able to
change its degradation level, specifies what its levels are in
terms of bandwidth. And for some operators, this is very simple,
right. If it's a histogram that you're down sampling, you know
perfectly well that if you drop half the buckets, you'll have
half the data. That's the easy case. The hard case is you're
doing something like coarsening, where you don't really know.
And in those cases, the operator will estimate.
>>: What happens if there's multiple ways for the controller to
get the desired effect of reduced bandwidth?
>> Ariel Rabkin: The controller has the policy, and the user
explicitly specifies what is the priority to do. What to try
first.
>>: Both in terms of the data sources and in choice of
operators?
>> Ariel Rabkin: The model is that you fix the, quote, operators
in advance. And then the policy specifies which of them to apply
first. All of these operators, you should think of them as a
sort of variable resistor or a sort of variable faucet, where it
can be all on, all off or somewhere in between. And by default,
they're all on, and so the question is which knob to turn.
>>: What I meant by that is the controller could choose to
either greatly reduce one data source's
>> Ariel Rabkin:
Yes.
>>: Data supply in order to get latency down, or it could peanut
butter the thing and just spread it evenly across all of them.
>> Ariel Rabkin:
Yes.
>>: Moreover, for any given data source, it may have a choice of
several operators.
>> Ariel Rabkin:
>>:
In different combinations with the desired effect?
>> Ariel Rabkin:
>>:
Yes.
Yes.
So all of that is wrapped up in this?
>> Ariel Rabkin: That is all wrapped up in the policy. And the
policy language we currently have is not infinitely flexible. It
lets you specify the priority of your operators. You could
imagine extending it to cover sort of every other such case. We
think it's a very great advance to have a centralized policy and
a point of mechanism where you can apply it. And that's sort of
the contribution here. We aren't claiming that we have the all
purpose policy language to express every possible data
transformation. That would be nice, but that seems sort of out
of reach of current techniques.
>>: I just have this feeling that without the topology, it's
kind of hard for this to work. And feel free to push back.
Maybe I don't understand. But one of the things I think you
mentioned was that the system was able to degrade the data but
was not able to sort of go back. If the bandwidth we comes more
plentiful, the going back part wasn't implemented yet or at least
you guys haven't done that.
>> Ariel Rabkin: So what's implemented today is if you want to
go back and get the data, you'd issue a query and the data will
be copied.
>>: [indiscernible] but the data that is going to come in the
future, that I want now to go back to full resolution because I
have [indiscernible].
>> Ariel Rabkin:
tune
>>:
Sorry.
The system is auto tuning.
It will
It is auto tuning?
>> Ariel Rabkin:
Yeah, it tunes back up as well as down.
>>: So then I don't understand why you don't encounter flopping
issues where if you have multiple nodes that shift their bottom
[indiscernible] start to degrade and those start to sort of
>> Ariel Rabkin:
>>:
Yeah, so
[indiscernible], you know.
>> Ariel Rabkin: So the answer is you see a little of this by
sort of tuning of the control algorithm, you're able to minimize
it and you get pretty consistent performance and pretty even
performance. The other thing to say is that often the topologies
don't really look like that, that if what you're worried about,
for instance, is sensors, typically the first link is where the
bottleneck is and that's not shared. If you have data, coming
from cell phones, there isn't really shared bandwidth there.
That the limit is my link to the tower.
>>: Is it fair to say that while the system doesn't have control
over the network, the system could infer where the bottlenecks or
and maybe not bottlenecks are about what sort of operators share
bottlenecks.
>> Ariel Rabkin: Being able to either acquire that knowledge
automatically or incorporate that knowledge into the system are
both really interesting research problems that we have not yet
tackled. Yes. It would be nice to handle those cases. We think
it's still useful without them. Next. No? Good, all right,
thank you. This was fun.
Download