1

advertisement
1
>> Jim Larius: Let's get started. It's my pleasure today to welcome Armando Fox
from U.C. Berkeley, where he is the PI of the RAD Lab project, which is quite an
ambitious effort to do a better job of building datacenters and datacenter software,
funded by our friends at Google and our friends at Sun and a host of other companies.
Berkeley and Microsoft -- Microsoft helped start this at Berkeley three or four years
ago now. It's been quite an interesting and successful project. I think Armando
is going to tell us about some of the research that's been going on.
>> Armando Fox: Yep. So thanks for having me. Thanks to Jim for inviting me out.
It's kind of fun to always come here and see a handful of familiar faces in one
capacity or another. Including some people I think of as MVPs of the RAD Lab because
they come to the semiannual retreats and give us their feedback. As you'll see,
we take the feedback very seriously.
I've been trying to wrap up what we've sort of done in the last three or so months,
three to six months, I guess, in terms of the RAD Lab's research agenda, and I was
struck at how often a major turn of direction that we did was influenced by feedback
from someone here, someone at one of the other sponsor companies. So forgive the
assortment of sponsor logos, but as you all know, these are tough times for research
funding so we need to make sure we acknowledge the help of all of our sponsors. It
used to just say DARPA. That was several years ago. That's changed now.
So as Jim mentioned, I'm one of the PIs of the RAD Lab. We have this very ambitious,
kind of moon shot five-year mission to explore strange new worlds. No, that's the
Enterprise's five-year mission. Ours is to enable a single person to essentially
deploy the next large scale next generation Internet service without having to build
an organization the size of Google or eBay to do it. Some of you have seen some
variant of the slide before. It's a five-year project and we're now about two and
a half years into the project. The main features are that to the extent that there
is such a thing as a datacenter operating system or operating environment, we placed
a strong bet that statistical machine learning would be an important component in
how to make policy and analyze the resource needs and performance predictions and
so forth of datacenter applications and that virtualization at various levels would
be the mechanism that that policy would drive.
So we've been creating pre-competitive core technology that spans machine networking
and learning, particularly trying to apply machine learning to look at some of these
problems of resource allocation and performance prediction. There's six of us and
2
a small Army of graduate students that span all of these areas.
So what this talk is about is we started this in early 2006 formally. Where do we
stand now and kind of what direction is the project taking and what can we expect
over the next two years as we mature the technology.
So in early 2007, Luis Burrows [phonetic] from Google came out and gave a talk roughly
titled The Datacenter is the Computer. The point he was making, which is not lost
on anyone here, is now the datacenter is really taking the place of what we used
to think of as a server. The program is really this sophisticated collection of
distributed components that runs on thousands of computers in one or more buildings.
So warehouse sized facilities and warehouse sized workloads are now pretty common.
Even in just the last three months, we've seen some interesting new ideas for how
to change the packaging of datacenters. You know, this was the -- Luis' original
slides just showed a picture of a building. And since then, we've managed to add
a picture of a trailer like the Sun black box, and there was a datacenter knowledge
blog article about how Microsoft tried putting some servers in a tent, sort of the
ultimate in natural air cooling.
Google has talked about doing offshore floating datacenters. So no shortage of
ideas in how to change what's happening in terms of the physical construction of
datacenters. An upshot of this is during one of our recent retreats, we started
getting consistent level of feedback from Greg Papadopolis [phonetic] at Sun as they
were getting ready to announce Project Block Box. They said you should really take
a look at how this finer grain way of deploying datacenters is going to affect the
overall datacenter Brian Burchett [phonetic], who recently joined the Google
research lab in Seattle, said, you know, this really means that substantially every
interesting application is going to run across datacenters, S, plural, so whatever
you're doing in terms of architecture, you need to make sure it has natural extensions
for that world.
And James Hamilton, whose sort of pet interest with us is storage models and relaxed
consistency and innovative new ways to talk about distributed storage, kept
emphasizing that some of the storage projects working on had also better be
multi-datacenter. So we took that seriously and we've decided that now the
datacenters are the computer, which is a slight twist.
Of course, if you think about it, it makes a lot of sense because I think everybody
agrees that conventional designs for large building datacenters have sort of hit
the wall in terms of power and cooling.
3
And if we're going to talk about true utility computing, elasticity for a single
datacenter is great, but as with a true utility company, you can't bet on only a
single one. If you're running your business in one datacenter and a natural disaster
occurs, you're out of business. So if it's going to be a true utility environment,
we need geo replication anyway. And as James Hamilton and others Microsoft
colleagues pointed out in a recent paper, it also gives you an opportunity to think
about building the datacenters themselves from commodity parts. Like off-the-shelf
buildings. They made the case for condominiums, which is admittedly a little bit
at the extreme end of the argument. The point is well taken that the packaging and
physical plant of datacenters certainly hasn't reached the point where we can sort
of commoditize it and use it elastically the way that, for example, in the way EC2
demonstrates in the last couple years, and I understand there will be a comparable
Microsoft announcement that may be happening even as we speak.
So if this is going to be true, what are the software abstractions in architecture
that will be needed to deal with these cross datacenter issues, and what kind of
infrastructure and tools will you need to predict and manage the resources that will
be used by applications as they span multiple datacenters. So that's kind of what
we've been doing and what the talk is about.
Here's what I think the story is so far in terms of the project and what I'm going
to summarize our progress on recently. One big theme, like I said, we placed a big
bet on machine learning, and I think the bet is paying off quite well. The bet is
that with machine learning, we can predict and also optimize pretty complex
application resource requirements, and that means that it's okay to use high
productivity languages that favor programmer time, debuggability, and
maintainability over single node performance and then incrementally fine-tune them.
So we've kind of been fighting this little bit of spiritual battle about whether
very, very high level languages like Ruby on Rails which is our current favorite
framework for developing applications can really be justified. I'm hoping i can
demonstrate with a couple of examples that we believe the answer is yes.
The second major point is that although we've managed to successfully commoditize
things like the virtual machine operating system, there's a whole software stack
for horizontally scaling these three-tier applications. But there isn't really any
out-of-the-box solution for datacenter scale storage for these application. That's
still the holy grail. I don't believe there will be sort of one silver bullet that
will solve all of that. I think there will be many different solutions for storage
in the datacenter. We'll talk about one we've been working on called SCADS that
we think fills a really important gap where there isn't any functionality right now.
4
It fits the elastic computing model. It fits the idea we can exploit machine
learning to do provisioning and performance prediction. It's a natural fit for the
programming models of these web 2.0 applications. So that's kind of where we're
going.
So taking those one at a time, sometime in 2007, we sort of, you know, I think we
crowned -- there we go. I knew there was a -- we had this kind of jokingly, we said
we're crowning Ruby on Rails as the winning language for programming the datacenter.
Now, that certainly doesn't mean we think it will be the only one for writing high
level applications, but we observe that it was a framework that's easy to learn,
has abstractions that are a natural fit for a lot of the popular web 2.0 applications
we've seen, social computing and so on. It has a tool set that's evolved around
it that matches the development and deployment life cycle for service based software.
Frankly, it's a tasteful language. There's something to be said for a language that
avoids getting students, who are future developers, into bad programming habits.
And it certainly has that property.
We thought there was also, from an education point of view, there's a teaching
opportunity here because all of a sudden, we can use a tasteful language and do good
CS pedagogy but have an interesting, relevant project that comes out of it, as
opposed contrived problem sets. That's important for us as a research lab because
it means we can essentially use the students as guinea pigs. We can use the students'
applications as test applications on our infrastructure, and we can use the students
who are taking some of the more advanced courses to see whether the kinds of
abstractions we'd be building in the RAD Lab would be useful to representative good
developers.
So that's the route that we went. We decided to bootstrap some efforts at Berkeley
and Ruby on Rails. We have two very successful courses we've done multiple
iterations of. One is aimed at lower level undergrads and is really just about
understanding the moving points of a Web 2.0 application. The more advanced one,
which we're teaching now for the first time, also gets into the sort of operational
aspects of software as a service.
So watching a database tip over under synthetic workload, finding out where the
bottlenecks are by looking at your query logs are kind of the operational issues
that when you start to scale up the applications you have to deal with is what we're
teaching now in the senior course.
Both courses end up with a term project.
Many of the term project had external
5
customers by the time the course was done. And this is bearing in mind a lot of
the lower division undergrads have never typed at a command line. Let me say that
again. They've never typed at a command line. So, you know, basic things like
teaching them to use version control and sort of team development skills has been
part of this.
We've also developed a canonical Web 2.0 application we can use essentially as a
benchmark. It's a social networking ap that has two completely distinct
implementations in Rails and PHP. It comes with a tool chain around it with
benchmarking and synthetic load generation. We think that, you know, to the extent
we wanted to show that if you can bootstrap people in a highly productive framework,
they can do from zero to prototype in like eight weeks, we've had a number of very
good success stories, again out of undergrads who -- many of whom had never typed
a command line. And in eight weeks, they had an application that was interesting
enough for their own colleagues to use, and I was particularly pleased that I found
out there's a get out the vote drive going on at Berkeley. The application being
used to coordinate volunteers was one of the class projects from sort of a year ago.
So we think we made a good bet on this one.
These high productivity frameworks are great. We don't care that their single node
performance is sort of an order of magnitude worse because we love Moore's Law.
Let's talk about the productivity tax. Because getting into this, one of the issues
in the development community around things like Rails is sort of can you afford it?
Will it scale, which is a little puzzling to me because if you have enough machines
and you're not talking about centralized storage, everything scales. We'll come
back to that.
We actually found, and this is using benchmark ap that we developed, that yes, there
is a productivity tax. Compared to sort of P HP, which was the most recent thing
that Ruby was displacing, the productivity tax is a factor of two to three, meaning
other things being equal, you can support 2 to 3X more users with the same hardware
and meeting the same SLA on PHP versus Ruby. It's hard to do apples to apples.
Basically, you take a good programmer in each language and give them the same
specification in what application to write. It's a social computer application so
lots of short writes, densely connected graph.
Given this productivity tax -- so it's not an order of magnitude, right, is the first
message. A factor of two to three is manageable. So the question is, how do you
identify where the bottlenecks are, how do you get back to that factor of two to
6
three, How do you identify for that matter opportunities where you can save energy
without sacrificing performance?
Our bet, as I said, has been to use state-of-the-art machine learning algorithms,
and the other piece I'll talk about in this talk is how to make the algorithms
available essentially as services of their own. So that part of what's running in
the datacenter are services that model other things running in datacenter. It made
my brain hurt last night when I tried to put that in slide form. Hopefully, it will
come across better.
Given that background, here's the outline for the rest of the talk. I'll first talk
about our experience using machine learning to automate tuning, to find
opportunities for optimization, to find performance bugs. And I'll give a couple
of specific examples of work that we've done recently to give you an example of what
kind of results are possible. I'll also show you architecturally how we think we
can put these things into a framework that is one of the services that runs in the
datacenter.
Then I'll talk about this other aspect of can we make storage sort of as trivially
elastic as the stateless parts of these three-tier applications. What would it mean
to do that? What would it mean to make storage sort of scale independent so you
don't have to rewrite your application and reengineer your storage every time you
scale by an order of magnitude.
So that end, I'll describe the work in progress on SCADS, which is a Scaleable
Consistence Adjustable Data Store that fills in gaps left by existing data storage
solutions that are out there. We are doing some work on batch jobs with things like
Hadoop and improving Hadoop scheduling and so forth. Strictly because of time
reasons, I'm not going to talk about them. We're focusing on this talk just in
interactive applications, but the lab as a whole is doing much more than that.
So let me start with the machine learning part and with some Diet Coke. And I'm
going to present two short stories about machine learning. Both of these are
examples of using a state-of-the-art set of machine learning techniques mapped on
to a problem of diagnosis or performance debugging, and kind of what they have -- they
have two things in common. One of them is that they solve a problem where sort of
trivial techniques would not have worked. So the kinds of techniques that you would
get from sort of linear algebra or basic statistics just didn't give good enough
results. The message from that is machine learning really is a technology. It's
a technology that has actually advanced tremendously in the last decade, and it's
7
seeing a Renaissance in part because there are algorithms that were not practical
to run 10 or 15 years ago but are practical to run now.
There's also algorithms practical only in offline mode 10 or 15 years ago but now
can be run essentially in online mode, inducing new models in near realtime. So
there's a real opportunity in link thinking about machine learning as a first class
technology. The second part that these have in common is there's a big challenge
in how to take on the one hand a well-understood machining learning technology and
map it on to the physical features of some systems problem you want to solve.
There's actually a lot of subtle detail in getting the mapping right, and that really
requires collaboration between machine learning experts and domain experts in
architecture systems and software. If I had an uber message about this is you should
hire people like that. Not me, I'm not looking for a job right now, but I know plenty
of such people.
Two short stories about machine learning to give you a flavor of this. The first
has to do with analyzing console logs. So console logs are the things where your
debug print statements tend to spew stuff and even your production log-in statements
tend to spew stuff. In general, they have the property, the console log messages
are often useful to the person that wrote the code that generates that log message
and not hugely useful to others. In fact, even the operators, who are often the
ones that care about diagnosing problems, have trouble using the console logs because
the kinds of problems operators care about may be dependent upon the run time
environment. So the specific messages developers might put in might not point to
directly what the problem is.
So what one of our students has been doing is applying a combination of text mining
and anomaly detection techniques to essentially unstructured console logs from
different applications. In a recent case study that he did, he ran sort of a 20,
25 note instance of Hadoop on EC2. The -- well, this is out of date. We actually
did a paper that doubles the size of that right now.
But basically, those 20 odd nodes generated a combined total of a little over a
million lines of console log. And this is just statements developers have put into
the code.
What he does is he would combine the unstructured log with the source code that
generated it and, you know, in this open source world, it's not beyond recognition
that you could get source code. And by combining those two, he was able to extract
8
structured features out of the log. And in a number of cases, those features could
not really have been extracted reliably unless you combined source code analysis
with analyzing the log. So just looking at the log in isolation would not have been
enough.
Then he can identify unusual features using anomaly detection techniques, like
principle component analysis. And from the principle component analysis, which
separates out the normal from the abnormal log features, he can then induce a decision
tree, which I don't know if people are familiar with decision trees, but it's
basically a way of classifying. You start at the root and you evaluate each
condition. The condition tells you the extent to which each factor would influence
a diagnosis of normal and abnormal behavior.
So the advantage of something like a decision tree is that a system operator who
doesn't know anything about machine learning and principal component analysis could
take a look at this and essentially use it as a flow chart to figure out how likely
that this combination of log messages, when seen in this sequence, is indicative
of a problematic behavior.
The innovation here is really combining the text mining of the log and feature
extraction with a way to turn that into a decision tree automatically. Yes?
>>: Examples of the features that you've [inaudible] structured log and source
code.
>> Armando Fox: So one example of a feature is a message count vector. You break
down all the different possible types log messages that could ever be generated,
and that's a step where you need to do source code analysis. You also figure out
for each of those messages what parts of those messages is constant text versus which
parts came from variables, and you can use source codes to figure out what the
liveness path of those variables was.
Now you group all things by the values of the variables and you say for each valuable
of the given variable, how many times did each type of log message print the variable
when it had that value. Then you can use principle component analysis to figure
out which log messages are abnormal with respect to some threshold. He used the
Q statistics for thresholding, but you can do it other ways.
By the way, I would normally apologize for trying to squeeze an eight-page paper
into one slide. But the goal is two short stories about machine learning, so I'll
9
try to keep some time at the end to go into the gory details.
he find is what you care about, right?
You know, what did
One example, he found a rare failure scenario in Hadoop, that when it occurs could
lead to under replication of data. Of course, we haven't seen data disappear in
our 20, 25 node experiment. But now that we're talking about running things on
thousands of nodes routinely, failures that are supposed to be rare become a lot
less rare. So, in fact, a case where under replication occurs might be important
because part of the premise of the Hadoop file system is that replication one of
the key strategies for durability.
It's a bug that could only be detected. If you were trying to detect it by hand,
you would essentially have to analyze the block level from the Hadoop file system.
Even with the small installation, a million lines of log is more than anyone is going
to do manually. As a kind of a side bonus, he discovered that a particular error
message that occurs pretty often in the logs that has these ugly looking exceptions
and warnings in it, actually turns out to be a normal behavior in the sense that
it doesn't cause a correctness problem, but it is a potential area for improvement
in performance, because it's a performance bug. Basically, a bug that leads to
needless retries of a file system write.
So you know, this is basically without any human input at all, right. The operator
of the system didn't need to supply any domain expertise at run time in order to
create the model. That's one example of applying some nontrivial algorithms to pull
out some information that could not have been extracted manually or with simpler
means.
A second short story about machine learning has to do with finding correlations
between multidimensional data sets. And in a second, I'll give you two quite
different domain examples of how this is useful. But to explain the concept first,
the idea is that you have two highly dimensional data sets that have a one-to-one
correlation between data points. For each data point in one set, it has a
corresponding point in the other set.
And the goal is to find highly correlated sub-spaces of those two data sets.
Intuitively, what you're trying to look for is sub-spaces such that when you project
each data set on to the corresponding sub-space, you essentially preserve locality.
So things near each other in one projection will end up near each other in the other
projection.
That so far is what's been called canonical correlation analysis.
That's been
10
around for 40-plus years, but a recent innovation to it really just in the last five
years is the idea that you can use a kernel function, as opposed to just a Euclidian
distance or other linear distance function so that if you have non-numeric data and
are trying to find some notion of locality or similarity between the two points,
you can do it in a less trivial way.
So again, this is really just in the last five years that this has been done. What
this looks like graphically is the goal of the algorithm is to find these matrices
A and B so when you project each data set down on to the sub-space you'll get, for
example, the little red thing being a previously unseen point in the left-hand data
space, which I'll explain what static application features are.
It's a simple matter to project that point down to the corresponding sub-space. We
can identify what its nearest neighbors are. We can also then identify where the
nearest neighbors of these three previously seen points are in the right-hand side.
We can interpolate to find what we believe the approximate location of this
previously unseen point would have been, and then we can use some heuristics to
identify which points these correspond to in the unprojected space.
So that's pretty abstract. Concretely, what would this be useful for? Imagine that
on the left-hand side, what you have is static features of some applications. For
example, the features of a query whose performance you're trying to predict. On
the right-hand side, you have training data that for each query that you tried, you're
able to measure various aspects of its performance, using standard, you know,
performance counters and metrics available through things like open view. And your
goal is to understand what's the relationship between a given query and the
performance that you observe.
In particular, what you like to do is if you're now handed a query you've never seen
before, you can find queries that in this projected sub-space are similar to it.
You know the performance of these so you can interpolate what the performance of
the unseen query would have been in this projected space. Now, you look at the ones
you've seen before, and you basically say, well, I'm going to make a guess that in
raw performance metric space, that's roughly what its performance is going to be.
There's a lot of assumptions embedded in doing this.
>>:
But yes?
[Inaudible question].
>> Armando Fox:
No, you can't.
And that's another example of why this is an example
11
of machine learning as a technology. Trying to operate on the raw data space, it's
very difficult to extract what these correlations are. But the projection is what
gets you this locality preservation and what allows you to make these guesses about
what the interpolation is. Yes?
>>: [Inaudible] projection because in many situations I can see that it's not
[inaudible] projection. There may be a bunch of [inaudible] attributes. And to
project on to that space, then you may get a much more robust predictor. Again,
I'm talking in abstract.
>> Armando Fox: Right. Well, in general, that's what you're doing, right. You're
looking at these as derived attributes from this, and the derivation is this matrix
you're trying to find.
>>:
[Unintelligible].
>> Armando Fox:
>>:
That's part of the algorithm.
[Inaudible].
>> Armando Fox:
>>:
No, we find it automatically.
Magic in big scare quotes.
[Inaudible].
>> Armando Fox: Right. Well, you know, what I've found, and it's kind of
encouraging, I'm not a machine learning person, by the way. I've tried to osmose
a little bit of it in the last two or three years. I'm really from the systems world.
What's promising methodologically about this line of approach is that to
oversimplify only a little bit, it's relatively easy to understand how these
algorithms work and to apply them. I mean, easy with quotes, right. What's hard
is proving things about how good the algorithms are. So the machine learning papers
that talk about this stuff are focused on why they work. Understanding why they
work is difficult. Actually understanding how they work is accessible to even
people like me.
>>: [Inaudible] how do you know this is robust? Part of the problem here is a
breakdown of certain points, there are in-built assumptions here. For example, I
can take the last slide and say therefore you have solved the fairly difficult problem
of predicting the execution time of the complex sequel, okay? Well, you know, this
can have a revolutionary effect on the industry, basically.
12
>> Armando Fox:
That would be nice.
>>: Yes. So I think the basic question becomes practically how rigorously do you
[unintelligible]?
>> Armando Fox: Keep in mind that the way that we get the initial data points is
by actually running the queries and measuring the performance, right?
>>: [Inaudible] against a larger data set, a very different distribution of the
data where the skew could be widely different.
>> Armando Fox:
Sure.
>>: There are a lot of features that are very significant [unintelligible] and a
good way to proceed maybe is very hard, right?
>> Armando Fox: Yes, so let me say two things about that. And hopefully, we'll
get to talk a little bit afterward. One of them is that there is -- if we kind of
step back from this approach, even, there is a number of general issues about applying
machine learning to any given domain where there are fundamental problems that are
not problems of the algorithm. To oversimplify what I believe you said, if there
are features and interactions that I never see in my training data, but as it turns
out to which my results are very sensitive, if I sometime in the future try to do
prediction on queries that exhibit those features, I'm likely to do poorly. And
the only way that I'll know about it is when I actually run the queries, I'll realize
that my results don't match, right?
So that is a general methodological issue with any machine learning algorithm applied
to any systems problem. There's no magic bullet solution to it. But one of the
promising things that has happened in the last five, ten-ish years is a lot of the
processes for inducing these models, they now run fast enough on commodity hardware
that you can be inducing models all the time. So, in fact, there was some other
work that I did which I'm not presenting in this talk was about three or four years
ago collaborating with HP labs.
We would be building signatures to capture the most important metrics that were
indicative of a problem in an online system. So in other words, when you had an
SLA violation, what are the four or five most important metrics and their trends
that contributed to that? And the way that we dealt with the instability problem
13
in that system was that every two minutes, we would build new models and we would
compare regressively whether they were doing better than the ones we had.
>>: [Inaudible] and this new query happened to be in a concurrent execution, the
same block of data, because effectively there is hot spots [inaudible]. And this
kind of [inaudible] is extremely hard to model. There's a problem with the
robustness of the approach.
>> Armando Fox: Yes, we're in agreement. They're extremely hard. And by
definition, a hot spot is a dynamic condition, right? I mean, if it was a static
hot spot, it's not really a hot spot. You'd be able to see it in your training data.
So I think -- I hope we get to talk more offline, but the general answer is, in many
cases, what you really need to be doing is essentially inducing models all the time
and comparing whether there's something about reality that has changed.
There's a model life cycle issue that is really what it gets down to. When your
model diverges from reality, has reality changed, or is your model actually finding
that there's a problem? That's an open question. I'm not claiming that we have
a sort of solution for that in every case.
Briefly, the approach I just described, we've actually executed it on two quite
different cases. One of them is, in fact, long running queries in a business
intelligence database. The static features are derived from the execution tree that
the query planner gives you, which it can hand you in sort of under a second. The
dynamic features; in other words, what you measure when you're running your training
data, is based on system instrumentation such as what you can get from Open View
or any number of other instrumentation systems. The goal was to predict
simultaneously the query running time and the number of messages exchanged among
the parallel database notes.
Then we did a second scenario which had to do with tuning a stencil code, which is
similar to running a multidimensional convolution for a scientific application on
a multicore processor. If you've done this kind of work, you know there are a number
of different compiler optimizations you can turn on. Not all of which give you
additive benefit. There are parameters for things like software blocking for the
matrix -- for the convolution step, and the stride and the padding amount of software
pre-fetching, and there's a -- we submitted a paper on this recently, where we showed
if you wanted to exhaustively explore the parameter space, even for a relatively
simple problem, you'd have to be running experiments essentially for a month.
14
The dynamic features obviously are what you can get from CPU performance counters.
So not only running time, but how much memory traffic it generated. In our case
we also looked at energy efficiency.
In those scenarios what did we get? For the business intelligence queries, 80% of
the time, we were able to predict the execution time and the number of messages
simultaneously, both within 20% of observed. And this is, you know, using standard
end fold cross validation on the training data.
If you remove the top three outliers, which were far outliers, with we have a very
good correlation fit. More interestingly, when we use the same model to predict
the performance if we were to scale the cluster from four up to 32 parallel servers,
we got a very good fit for those predictions as well.
In fact, the predictions are better than those that the query planner itself handed
back as its prediction for execution time. So you know, again, open problems remain
in sort of instabilities and unseen features, but I think the idea here is that this
is a right direction to go. And by the way, we did try single and multivariant
regressions on this, neither one of which worked particularly well.
For the scientific code running on multicore, we were able to get within two to five
percent of what a human expert who understood the domain and the hardware had been
able to do. In fact, if you look at the sort of theoretically possible stream
bandwidth from memory to keep the multicore processor busy, we got within 95% of
the possible maximum.
We also identified sets of optimizations that don't give you additive value, which
is nontrivial. So if I turn on Optimizations A and B, I don't get the sum of the
values of Optimization A separately from B. We also identified configurations where
the performance was essentially the same, but there was slightly better energy
efficiency.
On the relative small problem we looked at, the energy efficiency was not better
enough to get excited about, but the fact that we were able to identify it as such
is potentially something to get excited about. Yes?
>>: [Inaudible] would you have to construct separate models for each of your input
data sets [inaudible]?
>> Armando Fox: In the multicore case, what we're basically doing is we're using
the modeling technique to explore the parameter space more efficiently. We're
15
exploring a very small subset of the parameter space, and we used a semi-greedy
heuristic to try to figure out which things to try next.
>>:
When you're looking at [inaudible].
>> Armando Fox: It's actually all one model.
both types of parameters, all as inputs.
It's a single model that considers
>>: [Inaudible] might change, depending on the data you're feeding into the stencil
code.
>> Armando Fox: Actually, for stencil code, it doesn't change that much. Stencil
codes are fairly uniform. We're actually doing, one of the things we're looking
at next is multi grid methods. Those have kind of changing over time granularity
of the, you know -- they're problems that don't have the same regular structure as
a stencil. In fact, one of the interesting things about doing this in the multicore
domain -- I'm not an architecture guy. I used to be a long time ago. If you compare
what are the obstacles to applying this methodology in the systems world versus to
applying it in multicore architecture world, you know, in the systems world, it's
hard to get data. Runs take hours to do. You know, it's pulling teeth to get the
data out of companies, certainly.
The problems are very dynamic. They tend to be workload dependent and very workload
driven. In the scientific computing world, the workloads tend to be more stable.
Experiments take minutes or hours to run rather than days to run. You can get the
data by running things on your desktop. So I actually predict that there's going
to be a lot of interesting results in this area. And it's going to outpace what
people are using machine learning for in systems, because so it's much easier
mechanically to go through the methodology and get the models built. But that's
neither here nor there.
Okay. So two short stories about machine learning. The moral of the two short
stories is as we hoped, it looks like it's a good bet that machine learning is a
promising technology for both resource prediction and potentially problem
diagnosis.
The next natural question is how do you make that technology available generally
to other pieces of software that are running in the datacenter? And roughly, the
answer from us is we'll make them services of our own, using standard service oriented
architecture APIs.
16
Here's an example of how this might be done. For those of you coming to our
semiannual retreat, we're expecting to demo this there. The idea is that you have
always on instrumentation feeds and we have, actually have two projects in progress,
both of which are open source for essentially building instrumentation adapters to
different types of software that might run in a datacenter and making them available
in a software bus kind of way to whoever wants to consume them.
In our case, we would have modules that we call advisers that basically encapsulate
the closed loop of observing data, deciding whether to take some action, and making
a recommendation of what to do. And in making those recommendations, the advisers
can consult essentially these encapsulated machine learning models, where they'll
say here's a set of data, and I would like you to make a prediction about something
or I would like you to flag anomalous or interesting events in this data set.
So we have the algorithms that I just explained are being encapsulated as we speak
into these little models, and we're working on a closed loop version of this that
we hope will be ready to demo in January. It's really all just machinery, right.
There's no new technology sort of in this diagram. It's just a way of incorporating
this machine learning technology into the software architecture we already
understand.
The director for the moment is going to be pretty simple, because it's just
arbitrating recommendations from a small number of advisers. But from a machine
learning point of view, we've identified opportunities for multilevel learning
within the datacenter. That's temporarily sort of hidden inside the decision box.
Lastly, the configuration changes that are being made, the configuration is
essentially versioned as its own first class versioned object, and it's available
to the models and advisers. The idea being that what you're going to recommend doing
is a function of the current configuration and the observed instrumentation.
So again, there isn't any rocket science technology here, but it's just what are
the moving parts that would be required to convert essentially mat lab programs that
generated the results in the papers into things that could be used in a near online
mode in a real system. Yes?
>>: Are you keeping track of the, I guess, recommendations made by the advisors
and the actions taken? I mean, as a datacenter admin, it sure would be nice to know
why a particular machine was rebooted. What were the --
17
>> Armando Fox: Yes, we are keeping an audit log.
that, by the way.
Don't worry.
We thought about
So kind of what what's the -- where are we in terms of our use of machine learning?
The kind of vision is that there will always be on instrumentation. The machine
learning technology will be embedded in simple, service oriented architecture
modules that make them available as a datacenter service.
We believe that based on the successes we've had so far with machine learning that,
among other things, this can help compensate for hardware differences across
different clouds or datacenters. Because as you've seen, we can use it to construct
models of what the resource needs of applications are going to be over relatively
short time scales.
By January, we hope to sa a good demo of packaging these things in the director
framework. It's really just its own set of services that can be used as part of
automating the datacenter. That was part of the SML part of the story.
Let me switch gears with the storage part of the story and start with Facebook, the
Web 2.0 poster child. It's interesting the way that they've had to incrementally
grow and reengineer their storage as the site as grown. I'm sure these numbers are
now out of date because there from like a month ago. But the bottom line, assuming
you all use Facebook, which I'm going to assume that at some level you do. Just
admit it. I know you have.
They've federated nearly 2,000 my sequel instances to hold the various graphs that
capture the relationships among their users and the other objects in Facebook.
That's fronted by about 25 terabytes of ram cache, mostly in the form of mem cache
D, and the mem cache D interaction with the database is managed by software they've
written in house.
As an example of what's happened, when you log into Facebook and just for the benefit
of the one person out there who may never have used it, one of the things you see
when you log in is recent news items from your various friends. Friend X did this,
Friend Y joined a cause, Friend Z added someone else as a friend. That's actually
regenerated by a batch job.
You might think a natural way to do this is when I do something, that there is some
incremental update process that would happen whereby my friends eventually would
get notified of that thing. That's not actually how it's implemented for a variety
18
of reasons. It's a batch job that, in fact, can result in the data being stale on
the order of hours. So non-optimal.
They've reengineered the storage system multiple times, as they've tried to add new
features to the Facebook application and the scale's changed.
There's a long litany of other companies that have similar stories. One of the
reasons we've chosen Facebook is because they're very visible. Their social network
is a good example of why you can't sort of partition your way out of the problem.
And frankly, they've been extremely open in talking to us about their architecture
and the problems they've had, which has been very useful from an education point
of view.
So what we're after, and how would we like to solve this problem? I think we've
settled on calling it data scale independence. The big innovation of databases sort
of in our time or certainly mine is that you can make them essentially -- you can
make your application logically independent of the particular database schema,
right, that changes to the underlying schema need not result in you fundamentally
restructuring the storage elements of your application.
Today, that's not the case for scale. If you changed by number of magnitude the
scale of your application the number of users, you may, in fact, have to make
fundamental changes to your application logic to accommodate that, and that's not
good. So the goal is that as the user base grows, you don't have to make changes
to the application to keep up with it. The cost per user doesn't increase. So those
of you who have used the animoto, which is the new poster child for elastic computing,
animoto lets you automatically make a music video by matching up a song and an album
of photos you have, and will sort of figure out where the texture and tempo changes
are in the song, and stitch together your media to make a video for you.
This is the day they created a Facebook version of their ap, and they started signing
up 35,000 people per hour that day. That's a good growth spike, right? And the
request latency doesn't increase as the number of users scales up, and you can do
dynamic scaling both up and down. That's the case for scaling up. The case for
scaling down is that a lot of sites that are medium sized have enough diurnal
variation you'd be leaving money on the table by not undeploying, especially now
that cloud computing makes it possible to do at a fine grain, there's every motivation
to undeploy resources that you're not using.
So being able to scale down as well as up, which I think is kind of a new thing that
19
we didn't take as seriously before fine grain cloud computing, is part of data scale
independence.
So what are the functional requirements for this data scale independence?
Interactive performance. That's kind of a given. As I've said, we are working on
batch applications as well. For the focus of this talk, interactive performance
means when I do a query, I expect results in web realtime, so a couple of seconds.
The really missing piece, we think, is a data model that naturally supports the richer
data structures that are used by these social and Web 2.0 applications.
There are examples out there now of very highly scaleable data storage but whose
data model is pretty simplistic. So big table, Amazon S3 and Dynamo. Cassandra,
which is essentially an open source clone of Big Table, which we're using as part
of this project.
But none of them really support more than key value or key value plus column family
addressing, and that's not the data model that the application writer needs to work
with. These application writers are doing graphs of connected entities. They're
writing applications that are short write intensive. You know, you can tag things
or poke your friends or whatever. So you can't shard your way off out of the problem
because the graph doesn't partition terribly well, and you can't cache you way out
of the problem because the write workload won't permit it.
Scaling up and down, we already said. Support for multiple data Centers is sort
of a point of departure for us now. This is not need but tolerate. This is once
you paint yourself into a corner with relational storage, you ask where you can relax
the asset guarantees in order to get scaleability and availability. The classic
answer is you can do it by relaxing consistency.
In fact, Facebook is sort of an existence proof that these aps will tolerate relaxed
consistency, even if they'd rather not. Now, they wouldn't want it to be hours
before a change is visible to all of your friends, but minutes is conceivably okay.
Now, there's been a question for some time as to whether a developer can use this
effectively. In other words, are developers confused by a data model that does not
give you this guarantee that once I commit a write, it's really committed for
everybody.
Our view is that developers could use it if the consistency was something they got
to specify in advance and could quantify. And if we could use simple things like
20
session guarantees on the client side to give the developer a model that makes sense
to them. In other words, if I do a write, it may take a while for my friends to
see the write, but if I now do a readback of that thing, I'll see the thing I just
wrote. In fact, a number of the problems with other relaxed consistency approaches
have been that they don't give you even the simple session guarantees. So S 3, for
example, doesn't necessarily guarantee this.
So SCADS is a project we've been working on for several months that attempts to
address these needs. For interactive performance, we rely on a performance safe
query language that can say no. In other words, you can only essentially do queries
for which you have precomputed indices.
If you try to add functionality to your application that can't be satisfied with
the existing indices, then you have to change the schema in order to do it. A data
model that matches social computing. SCADS uses a number of indices to give you
the ability to put together object graphs. As I'll show in a second, the updates
to multiple indices that occur as the result of a single write are managed by separate
priority queues, and this gives you a way to talk about relaxed consistency.
Scaling up and down, we believe we will be able to use machine learning to induce
models on the fly of how SCADS is working so that as we observe more and more queries,
we can actually do a reasonable job predicting the resource needs of these queries,
and we can scale up and down as needed in response to that. And because one of the
goals of SCADS is declarative specification of relaxed consistency, we believe that
gives a natural extension to talking about multiple datacenters.
In fact, Facebook's current architecture uses multiple datacenters, but their
relaxed consistency has been engineered into the mainline logic of the application,
because there's no way to do it in terms of my sequel and mem cache D layers they've
been using. Yeah?
>>: To
this -and how
my data
session
what extent does your declarative approach -- maybe we'll talk about
but to what extent is it dependent on how you choose to partition you data
you choose to direct users to locations. Depending upon how I partition
or users requests get served, you may or may not be able to guarantee that
consistency you're talking about.
>> Armando Fox: Right. So the short answer to your question is if there are cases
where there's a consistency guarantee that is incompatible with the availability
desire, then the developer will have had to specify which one is going to win. In
21
other words, will you serve stale data, or will you just return an error and say
no?
Now, as far as the extent to which you are -- the performance depends on the way
that data happens on laid out and how many replicas you have for something that is
about to become hot, that's exactly the case where we're betting that we can adapt
the machine learning approaches from the first half of the talk to actually make
those decisions in close enough to realtime that we can essentially say we need to
be adding a certain amount of capacity, let's say, over the next few minutes. Or
we have capacity that's been observed to be idle or idle enough that we can start
consolidating again. So the answer is that yes, it's obviously sensitive to how
your data is partitioned and replicated, but we hope we can keep up with that in
a few minutes lag granularity so that we can actually make adjustments to that on
the fly.
>>:
[Inaudible].
>> Armando Fox:
>>:
That's okay.
You can dynamically, in some sense, replicate and de-replicate data?
>> Armando Fox: We are, in fact, part of -- so originally, we were -- we didn't
know how we were going to build that ourselves with graduate student resources. But
if you're familiar with Cassandra, Cassandra actually allows us to outsource some
of that. Cassandra gives you a big table like API and actually handles replication
itself, and you can do replication changes online. You can replicate and
de-replicate online. And I think you can do a limited amount of repartitioning
online as well. We're hoping that some of the machinery for doing that will be
handled by Cassandra. The policy for how to deploy that machinery will come being
able to induce models that successfully tell us what we actually -- you know, the
number of replicas, let's say, we need, for a given chunk of the data set.
That's the outline of the approach. This is in the process of being built, so the
answer to this question may change over the next three months.
So what do we mean by performance safe queries will support a sequel-like, probably
some sort of sequel query language, augmented with cardinality constraints so that
you can always make sure that you can return the answer to any query essentially
in constant time, or in time that doesn't scale with the number of users.
22
So we have multiple indices, depending on which queries you want to do. You can
only ask query for which there is an existing index. So you don't have an option
of allowing a query to just run slow. And the indices are materialized offline.
That means if you want to add a feature to your application that would require a
query for which you have no current index, you have to build that index offline before
you deploy that feature in your application.
Queries are limited to a constant number of operations that will scan a contiguous
range of one or more tables on a machine. If you're trying to add a new query or
new application feature, you already know what the -- I'll give you an example of
what a consistency constraint might look like on the next slide, but given the
specified constraints and the existing indices, would you be able to support that
query under those -- with high availability under that consistency guarantee. If
the answer is no, then you need to come up with additional indices so that you can
do it in constant time.
A consistency specification might look something like the following, where I'll
specify an obvious thing like a read SLA over some percentile of the requests. I
might want to specify what the staleness window is. And as far as conflicting
writes, I could have a policy like last write wins, or I could ask for writes to
be merged. And I'd like to have a session guarantee to read my own writes. So even
if writes take some finite time to propagate, I don't have a confusing model of not
being able to read back what I believe I just wrote.
Now, I think the question that was raised is what if it turns out at run time that
you can't have all of these things at the same time. For example, if there's a
transient network partition, you have to choose between delivering the stale data
or just saying no. If you don't have sufficient resources to meet the SLA, again,
you might have to return stale data from another partition.
So in any of these cases, the idea is that the developer would prioritize whether
returning something, even if it was stale, would be more important, or whether it's
preferable to just return an error. You know, in some sense, whenever you do an
availability/consistency trade-off, you always find yourself having to have the
contingency case, where what if you can't -- if you can't meet the trade-off that
you specified, which one do you say is going to win? Yeah?
>>: [Inaudible] I may not have any idea of what I can actually achieve. So is the
model I cannot specify the ideal world and something breaks and I back up and I say
oh, maybe.
23
>> Armando Fox: That's a good question. I think what the model will end up being
is, as we start to get a little bit of experience with the observed performance of
the system, you could -- as a developer, you could go through a very short iterative
process where just using the models that are in the online running system, you say
I want to do the following queries. And basically, the answer would be, well, this
update type that you want to do is going to lead to all these other cascading updates,
and you need to essentially rank order which ones are more important. If you do
that, I'll tell you what the worst case bound is by the time the last one will happen.
It will be a little bit like rank choice voting.
Now, I think the question you're getting to is how much knowledge do we need to give
the developer, and what form would the knowledge take before they can write down
any kind of consistency policy in terms that they understand. To be perfectly
honest, we don't know the answer to that yes. Initially, the developers will be
the students who are building the systems. So they'll just know.
There's an interesting open question is how do you know what's a realistic starting
point to ask for. And if what you're asking for can't be met, what hints does the
application give you back to tell you where you have to relax? And we don't know
the answer to that yet.
>>: Can the system tell me what things, what to change about my system in order
to meet these guarantees? It's one thing to say no, you can't do it. What would
be more useful is to say you can't do it now, but if you do XYZ, then you can achieve ->> Armando Fox: Right. So again, the honest answer is we don't know. What I can
tell you we would try, which is in the earlier part of the talk when we were talking
about mining console logs, we made this distinction between these black box models
that just tell you normal/abnormal, versus inducing something like a decision tree
that tells you why something is normal or abnormal. You can interrogate how it made
the chose.
The hope is to be able to apply some process like that to the models that we induce
for this. And hopefully, that would be able to tell you, you know, here's the point
at which I decided you couldn't do it, right, and here's the reason why. But I'd
be lying if I said that we had a better handle on it than that right now. We don't.
Yeah?
>>:
[Unintelligible].
24
>> Armando Fox: Basically, just choosing -- limiting how many results could
possibly be returned from any given query. So if it's a query that could possibly
return an unbounded number of results, you'll be hard limited to the first K result,
where K is probably a deployment time constant. You can always ask for the
subsequent K results sometime after that.
>>: [Unintelligible] but in order to produce that one number, I need to look at,
you know, a humongous number of outputs.
>> Armando Fox:
>>:
So the --
[Inaudible] can be much lower.
>> Armando Fox: That's why the only queries that you are allowed to ask are ones
for which you have already materialized an index to make it a lookup with bounded
range. What that means is if you do an update that could cause a change in the
hypothetical index -- well, actually, let me get to the -- I think the next slide
may answer your question. If it doesn't, we'll come back to it.
So the way that updates are handled in this world is quite different, right. If
you're not allowed to essentially do arbitrary scans to perform a query, right, which
is exactly what we're trying to avoid, then the alternative is that every time an
update occurs that might affect multiple other queries that they each have their
own index, you now have to incrementally propagate changes to all indices.
That's a fundamental difference in this approach versus what I think you described
is that every time we do a query that could result in cascading updates, we
essentially queue those cascading updates as asynchronous tasks. And each node that
has a set of updates performed has them ordered in a priority queue. The priorities
queues, priorities are based on how important each one of those things is in terms
of the developer's consistency specification.
So relatively unimportant operations; in other words, ones for which I can tolerate
much, much weaker consistency, I could wait minutes for the updates to happen will
actually float to the bottom of the priority queues. And priority queues is what
the modeling is based on.
So every answer is precomputed. Any answer that might be affected by an update is
incrementally recomputed. Unfortunately, I'm not going to have time to do a SCADS
architecture walk through. One of the features is every time you do an update,
25
there's an observer that figures out which index functions have to be rerun to
incrementally update some other index that is affected by what you just do. So we
have this -- the work queue is what we do the modeling around.
>>:
[Unintelligible].
>> Armando Fox:
>>:
Yes.
[Unintelligible].
>> Armando Fox: Taking away the beauty of sequel, but we are replacing it with the
beauty of scale independence. So ->>:
[Inaudible].
>> Armando Fox: In an ideal world, we'd like to have both, right. But I think what
we've kind of -- the lesson that's been learned the hard way over the years is it's
very difficult, as you get to larger and larger scale, to preserve all the beauty.
We had to pick which beauty we wanted. You're absolutely correct.
In doing that, you could argue, you know, to take a devil's advocate position, we're
undoing a lot of sequel in masking those operations from us, right. Whereas sequel
will essentially declaratively -- you can declaratively say what you want and
someone else figures out to do it, we're going back to a scenario where the sequence
necessary to do an update becomes significant.
That's the most effective way to expose this to the developer so they will understand
the impact of saying this is more important than this. It means that there's a
priority queue where this thing is going to get done first. The modeling tells us
is it quantifies how long it will take things to get done. Under different workload
conditions, the same priority queue could have very different performance models
associated with it. As the developer, you don't really know what that's going to
be. All you can say is the relative importance of different things. And our job
is to the try to keep the system in a stable operating state by using the model to
deploy more machines as needed.
Okay. So obviously, from the point of view of what do you do about datacenters,
with an S plural, if you really have a model where you're talking about relaxed
consistency to propagate updates across different indices, in some sense, dealing
with cross-datacenter data storage becomes a special case of that.
26
In fact, this is how Facebook works now, except they've had to sort of build their
own ad hoc solution around it, because they didn't have anything off the shelf to
start from. I mentioned that in order to keep this sort of tractable to developers,
we can present session guarantees such as read your own writes or monotonic reads
only. Those things can be implemented largely on the client side. It's actually
Bill Beloski [phonetic] who came up with the needed improvement that I think will
make the programming model tractable for mere mortals. Yes
>>:
You said a single update [unintelligible].
>> Armando Fox:
Right.
>>: Does that mean -- if we take Facebook for example where I change my status and
all my friends are supposed to see it, one way of doing that is to basically fan
the thing out in right time. I fan out the status change to my friends' news feeds.
>> Armando Fox:
Right.
>>: Are you saying another way of doing it is basically to just update my status,
and then when a friend logs in and updates the news feed, they basically pull the
feed?
>> Armando Fox: We're saying to do it the first way, that you do it eagerly, but
depending on how important the developer has said that this operation is, it may
be that the updates associated with doing that are not prioritized particularly high
in the priority queues. So it may actually take a while to do. Now, I think if
you're concerned about the 0 of 1 part, this is actually true of Facebook now, except
it's really 0 of 5,000. They have a hard limit on how many friends you have. They
have hard limits in a number of other applications. Things like causes, they limit
the number of invitations you can send out per day and stuff. They have hard wired
things built in all over the place.
I think over time, we may be able to do better than saying any given operation triggers
a constant number of writes. I think what we may get to is if you do 0 of N operations,
you'll get 0 of N rights, which is actually not the same thing. So O of N operations
applies O of N writes means you can have a few groups that a lot of people are
interested in, and you could have a few people who are interested in a lot of groups.
But you can't have a lot of people all interested in at lot of groups. But that's
far future.
27
Okay. Actually, we're doing okay on time. So I'm getting ready to wrap up anyhow.
Well, I won't have time to go through a detailed example, but the idea is if you
wanted to support in constant time queries like who are my friends, who are their
friends, and which of my friends have birthdays coming up soon, you essentially have
to maintain separate indices to answer each of these questions. And if I add a
friend, then potentially all of these indices have to change. If I change my
birthday, a subset of the indices would have to change. And those are the discrete
operations that go into the priority queues. So in the interest of time, I'm not
going to go into the example in any detail. Yeah, because I want to leave more time
for discussion, actually.
We're implementing this over Cassandra, which is an open source clone of Big Table
that's been developed by Facebook and put in the -- into Apache incubator by them.
We expect to have the indexing system and simple modeling of Cassandra doing reads
and writes all ready by January, and we are deploying it as a service on EC2. We
got a bunch of Amazon funding money to do that, which is very nice.
We also plan, I think more interestingly, to address the question of can developers
understand and use this model. And one way that we're going to do that is by seeding
it to our undergraduates in these very successful courses that we've done.
Those of you familiar with Ruby on Rails are probably familiar with active record,
which is the object relational mapping layer that it provides. Basically, that
layer allows you to talk about hierarchical relations among collections of things.
And there's a natural syntax for annotating those relationships with, for example,
things like consistency guarantees.
So our goal is to have a near API drop and replacement at the source level for active
record that would replace the sequel mapping layer that's built into Rails with a
SCADS mapping layer but provide very similar APIs. By the way, this is, I think
we realized in the last few months, a non-obvious benefit of using something like
Rails, which is strictly higher level in terms of its data abstraction model than
PHP.
In PHP, you have to write your own sequel queries, for the most part. There are
various libraries out there that you can use, but the idiom for PHP applications
is that you figure out your schema, and you figure out what your queries are going
to be. Whereas the common practice for Rails is you figure out what your object
graph is, and Rails actually takes care of the schema. The most common way it does
28
that is by using drawing tables and conventional sequel things, but there's nothing
saying you couldn't rip out that layer and replace it with a different layer that
still lets you express the object graphs, but implemented over a completely different
data store. So, in fact, we're hoping to seed this undergrads in the next iteration
of this course, which is going to be next fall.
So, you know, in terms of where do we see SCADS going, one of the things that we
got early on from our Microsoft colleagues is understanding who your developer
persona is. We've decided that the development target for a lot of these are
Elvises, you know, people who are, they're good programmers. They're not
necessarily Stallmans of the world but they have domain knowledge. They're
competent in sort of the current programming tools.
And our stand-in for Elvis is undergrads and our Ruby on Rails courses. Whether
these developers can understand and use this relaxed consistency data model and API,
well, we're going to hopefully by next fall be able to give them the APIs and have
them use it in their class projects instead of using my sequel, which is what they're
using now. And because we have funding money from EC2, we can actually deploy quite
large scale version of these applications, even if it means driving them with
synthetic workload, to evaluate, in fact, whether or not at scale we can make that
variable consistency mechanism truly work. That's kind of where SCADS is going.
So let me kind of sum up with another picture of the datacenter and where these
different technologies have figured into it. Scale independent data storage, this
is kind of, in our view, one of the big missing pieces that you cannot just take
off the shelf. If you want to scale the horizontal scaling part of your ap, the
stateless part, you can take that off the shelf with virtual machines, with a lamp
stack or your favorite other stack. You can take HA proxy for load balancing off
the shelf, mem cache D off. Everything is off the shelf, but storage is still really
elusive. There's no off-the-shelf really scaleable storage that is as elastic as
the rest of the application, and SCADS is hopefully intended to fix that.
The director is a way of making machine learning technology available as a service
so that we have the opportunity of systematically applying it to do things like
resource prediction in these very complex aps. And the hope is that just as it has
been part of understanding how to scale the simple parts of the ap, it will be able
to tell us how to scale SCADS, which is the more complicated part of the ap.
And we'll stick with using these very high productivity, high level languages and
frameworks, not just because we believe that with machine learning we can identify
29
where that productivity tax goes, but also because by keeping the developer at that
higher level, I think it gives us the opportunity to play around with alternative
data models. I think SCADS is one of a number of different data models that are
going to be prominent in large scale applications.
You know, we sort of look forward to seeing what the other ones are, but I think
the hidden benefit that we didn't see of these very high level languages is to the
extent that they insulate application developers from the details of the data model,
that actually makes it easier for us as researchers to experiment down in that space
and still have it be useful to developers, right, where they might actually be able
to apply their existing skills and not have to rewrite their entire ap from scratch,
which is what's been happening in practice with these deployed applications. Yes?
>>:
So --
>> Armando Fox: I hope you're in one of my afternoon meetings, by the way.
asking all these great questions.
You're
>>: No, I'm not, so I'll try and get in my questions now. So the underlying model
for SCADS seems to be that data is always stored persistently. Might be a little
bit out of date, but it's always stored persistently. You could also -- I mean,
the missing bit here seems to be data that is cached. I might have a semantic in
my data that says I have five datacenters. In three of them I have data stored
persistently, but in two of them all I have is a cache. That might be some annotation
I might want to put on my data.
>> Armando Fox: So you're talking about garden variety caching where you expire
things on demand, and you can always regenerate them? That's what you're talking
about
>>: Yes, you can -- seems like things like that is missing from SCADS right few.
Did you consciously reject the notion of caching?
>> Armando Fox:
>>:
Reject is such a strong word.
[Inaudible].
>> Armando Fox: To keep the project in scope, no, we did choose to owe omit it from
SCADS for two reasons. One of them is we really want to focus on does this plan
for doing variable consistency, is it really going to work. So part of it really
30
is to keep the project scoped.
The other part is that, you know, again, as a currently a Rails programmer, I've
done some PHP. I've done whatever the flavor of the month has been for the last
few years. The framework actually has framework level abstractions for caching that
operate on framework level objects.
So, for example, Rails has abstractions for page caching, caching of fragment of
a page, caching the result of a particular query, and it's actually got a very
well-developed set of abstractions for dealing with that. And you can use mem cache
D, you can cache in disk files, you can cache in ram. So in some sense, putting
caching in SCADS, given the way we want to deploy it, probably would have been
redundant. And all of the caching features in Rails are independent of the
underlying data storer. They're based on the abstractions at the framework level.
So that's the practical reason we did it.
>>: [Inaudible] relationship between the caches in Rails and your persistent data
in order to get a full view of really what your consistency semantics are?
>> Armando Fox: We'll see to what extent that turns out to be the case. Basically,
as long as each copy of the Rails application is essentially -- so if you've got
100 copies of the ap server, and SCADS is behind them, but each copy of the ap server
essentially believes it's managing its own caching, and if you can make reasonable
assumptions about affinity with cookies and stuff, I think the caching that's built
in with Rails will get us most of the way there. I think in the long-term, you're
probably right. I think any storage framework at this level will have to at least
make some caching decisions visible to the application stack. For the moment, we
think we can probably get away without doing that and hopefully still be able to
demonstrate something interesting about the data model. But in the long-term, I
think a more practical point of view, you're right.
I think officially, yeah, I think officially that's it. In the event that people
want to actually read more details about any of the stuff I talked about, I think
these are all of the papers that were -- all of the slides and graphs and things
came from one of these five papers. And then in case I missed any, I have my home
page up here. But that's kind of all I officially have. The questions have been
so good, I hope we still have more time for those. So thanks. You all are great,
and let's talk about more stuff.
[Applause]
31
>> Armando Fox:
Clap for yourselves, you're a great audience.
Go ahead.
>>: So when we talked about kind of replicating and de-replicating data, you
mentioned that you were working on basically machine learning models that figured
out [inaudible], right? Now, how long-term are those models? Because you could
imagine that depending upon the amount of data being replicated, it might make sense
to just leave data around, even if it's not being used for a number of days. Even
if I have a kind of daily pattern of usage that varies, I might not want to take
the hit of at peak time, replicate a bunch of servers, take the overhead of dealing
with consistency issues.
>> Armando Fox: Right. I can probably generify your question a little bit by saying
very often there are essentially stable patterns over longer time periods. If you
knew what they were, you don't really need machine learning to do that, right. If
you know what your stable pattern is going to be, you have some component of your
provisioning that's just going to make sure that you hit your stable pattern, and
then you use machine learning to deal with the unexpected variation. Right?
>>: [Inaudible] between, in some sense, kind of having a bunch of idle servers;
i.e., not using up all the CPU, versus other trails like using network bandwidth
to replicate data that you only need for a certain amount of time, right? So there
are multiple resources that you need to trade off. Just looking at CPU utilization
seems like kind of a subset of the space.
>> Armando Fox: Right, well, and I'm not claiming that's the only thing you're
looking at. In general, there's many dimensions of trade-offs one could make. I
think the thing that is going to force the issue on that is that, you know, Amazon
is the first sort of widely buzzed consumer version of cloud computing that seems
to have had some impact, has actually put a price on those trade-offs. You know
exactly what it costs per gigabyte of network bandwidth, per CPU hour, per
everything. But we don't have a plan yet for saying how we're going to co-optimize
all those things. Hopefully, if we can just get the simple case working initially,
like in the next six months, I think we'll have made a step in the right direction.
Yeah?
>>:
[Inaudible].
>> Armando Fox:
Are you in one of my afternoon meetings?
32
>>:
No.
>> Armando Fox:
Okay, too bad.
>>: [Inaudible] use for any operational system. Used for a variety of other aps
like for redaction and whatnot. But [inaudible] resource management and so on. Are
you aware of operation systems? This is more for my information.
>> Armando Fox: Am I -- well, I would guess, okay, so I'll give you two answers.
I would guess that that's probably happening at Google, but no one will actually
tell us that. I suspect more of their machine learning is really put into the
functionality of the applications, but I know they have used at least some machine
learning techniques even in an offline sense to analyze performance bugs and things
like that.
Also, you know, kind of closer to home, we've been working -- through Microsoft
silicon valley, we've been working with the Microsoft instant messenger team, and
we have actually used some machine learning to help them find what may be performance
bottlenecks in IM during their stress testing. So an answer soon might be it's being
used by some Microsoft properties. That would be all to the good, we assume.
>>: [Inaudible] you know, what happens in a [unintelligible], let's say, you know,
I am [unintelligible] one specific thing, this likely to be much more
[unintelligible]. Let's say you hard writing off simple DV or simple similar data
services or generate platform, you know, the problem there is that there are going
to be marketable consumers of this, and the assumption of the steady state requests
more justification there. So you use your director to take ->> Armando Fox:
We're exactly not assuming steady state.
>>: No, we're not, but that's the issue of robustness becomes much more critical,
because you're making important decisions, getting something, building an index,
and some of us [unintelligible] yes, you can back up, but there's a price. So it
almost becomes a key issue, and that's really what was my, you know,
[unintelligible].
>> Armando Fox: As I said, I think philosophically, we're in agreement. There will
always be unforeseen cases; and yes, you are trusting a machinery whose robustness
may not be fully tested.
There's no guarantee --
33
>>: There's no one ap running against it. That's a critical difference from a bunch
of successful scenarios where machine learning has been illustrated for us.
>> Armando Fox: That's true. But at the same time, I would say that's even true
now in production systems where machine learning is not used. So, you know,
Amazon -- Werner gave this key note at the cloud computing thing where they know
their diurnal variation to such a fine degree that they use that as their main
indicator of whether something might be wrong with the site.
So basically, if the order rate goes outside of a predefined envelope by more than
a threshold amount, it's a very likely signal. And they have extremely detailed
models of this. You don't need machine learning to make those models. They had
incidents where they thought their service went down, but it did not because of
essentially external conditions they had never modeled before.
>>: [Unintelligible] for them application of machine learning too is just fine.
When you're doing online intervention to move around with subsets, that's where,
you know, I have concerns. I have no know concerns if they're using these kind of
models to detect possible cases where, you know, [unintelligible] out of the loop.
Resource management directed by machine in complement, that's where -- again, I'm
not saying that's not possible. I'm saying ->> Armando Fox: I know, it's a legitimate risk. There's a bunch of other work we
have done that I didn't talk about at all that has to do with combining machine
learning results with visualization. So that as the operators see this is what the
algorithm found, here's what it would have done and summarized in kind of one picture.
The hope would be that over time the operators can either correct its mistakes or
at least come to trust that when it's put in control, it will do the right thing.
I think that's a necessary but not sufficient condition. But I don't know what the
end game of that is. I think you're right. Philosophically, we're in agreement.
Yes?
>>: [Inaudible] model online in a realtime, do you allow for addition or distraction
of features on the fly?
>> Armando Fox: Not so far, but we haven't ruled it out. We just hadn't thought
of it until just now when you said it. No. Was there something specific that made
you ask that?
34
>>: [Inaudible] unforeseen impact of features that, you know, in the first model
were not actually identified.
>> Armando Fox: So I guess it depends on what you mean by considering new features.
The kind of template for most of these methods is you have a candidate set of features
that you can extract, and then you have one or more feature selection steps where
you end up deciding which ones are going to be included in your model.
Now, certainly, it's the case when you're building models online, the set of features
selected at any given time that will become important in the model, it's probably
going to be different, you know, depending on what the external conditions are, but
we haven't thought about any approaches where -- we haven't thought about any
approaches where you would add new candidate features online. I guess to me, that
would be like you're going to come up with a new modeling technique. And when you
deploy that new modeling technique, it will include new candidate features you hadn't
used before. But I guess for some reason, I didn't think of that as an online thing,
but I guess it could be.
>>:
[Inaudible].
>> Armando Fox: Well, part of the goal of the plug-in architecture with the director
is to be able to add them relatively fast. You know, fast meaning you've qualified
them offline, right, and now they're ready to go into the production system. So
at least we would have the machinery in place that would hopefully let you do that.
But we haven't thought of doing it ourselves yet.
>> Jim Larius:
>> Armando Fox:
[Applause]
Okay.
Let's thank Armando.
Thank you, guys.
What a very interactive audience.
Download