>>: So I just got the signal we can... introduce virtually Ken Birman from Cornell University who will be...

advertisement
>>: So I just got the signal we can go ahead and get started. It's my pleasure to
introduce virtually Ken Birman from Cornell University who will be talking about
consistency options for replicated storage in the cloud.
Ken, it's all yours.
Ken Birman: Thank you very much, Roger.
And I want to apologize to the audience for not being there personally. I would've really
enjoyed the conference, but this is the best I could do, as it turned out, for the U.S. Air
and the weather yesterday.
So I want to talk about the challenge of replicating data in cloud settings, and in
particular to a premise that's been pretty widely accepted, articulated by Art Brewer,
who argued that when we build systems that need to be highly available, there's a
tradeoff between consistency guarantees and partitionability, which has been taken
much more broadly as performance.
And so the cap theorem, as he named it, is basically the claim that we must abandon
consistency in order to build large-scale cloud computing systems.
I want to question that and argue that that might not really be the case, and maybe
we've been a little too quick to abandon consistency.
Before I do that, I'll say, though, that it's a widely accepted assumption about cloud
computing. So, for example, if you attended lattice, the cloud computing workshop that
region two years ago, the guy who built the architecture of the Ebay system, his five key
arguments for how to guarantee scalability, the fifth one is to embrace inconsistency.
And that's actually the one he talked about most.
Werner Vogels used to work with me here at Cornell, and when he went to Amazon, the
first job he had was to clean up a huge scalability problem that the company was having
of fluctuations in load within their cloud computing system.
He tracked it down to a reliability mechanism in their publish and scribe architecture that
was being used to replicate data, and he basically stamped out reliability. So you have
to switch that to something slower that's not necessarily going to make such strong
guarantees, and he solved the problem and later was quite proud. He said, look, the
kind of reliability that was being guaranteed there -- and it wasn't a strong kind -- was
nonetheless was in the way of scalability, very much in the same sense as Eric's point.
And James Hamilton, who was one of the architects of Azure, now at Amazon, but at
the time was at Microsoft, he gave a great talk on this. And he basically said that
consistency is the biggest [inaudible] to scalability in cloud computing systems. He was
talking mostly about database consistency, but he said as far as he was concerned,
right way to handle this is to create a committee and tell people who want to build a
mechanism that involves consistency to first get approval from the committee.
And here you've got a picture of the meeting the last time they met. And his sense was
that as long as the next time they meet is about as far out as the last time, that people
would get the point and that they'll find some other way to build their apps. And people
are doing that.
So what I want to do is spend a minute now talking about what I mean about
consistency, what is this term, and why it matters, and then ask whether we can actually
have consistency and scalability too.
So I'm going to use this term to refer to situations in which something is replicated but
behaves as if it is not replicate. So a consistent system for me looks like a single fast
server running at Amazon or at Microsoft or whatever, but in reality, it's made of very
large numbers of moving parts.
And so we could draw pictures of that. And I'm going to do that in a second. Some
examples of things that have consistency guarantees are transactions on replicated
data so when we shard data in contemporary cloud computing databases, we're
basically abandoning transactional properties at the same time. That's how these
sharded systems are built. So transactions are an example of a consistency
mechanism. Cloud versions of large-scale replicated data, by and large, avoid the use
of transactions. Not always, of course. And don't take what I'm saying as an always
story, but I mean most of the time.
Atomic broadcast locking and, of course, locking through Paxos [inaudible] is a major
mechanism in cloud computing systems. Nonetheless, a huge effort is made to avoid
using locking because it's a consistency mechanism of the type perceived as
destabilizing.
Here's the replicated data picture I was going to make before. You can think of it as
locking if you prefer. So an example of a consistency property would be the following.
Suppose that I told you that I built a data center in which it looks as if the patient's
medical records are updated on a single server. And here's a picture of that happening.
So the timeline is going from left to right and the little blue stars are places where
updates occur.
And then I told you actually I built it as a cloud computing system and I spread that
service out on a couple of nodes, but although they're executing separately and in
parallel, the execution really behaves just as if it had been the original reference
execution. We call that a synchronous execution.
You can see that I've come up now with timelines, multiple processes, five in this
example, two of them fail long the way, but if you look at the actual events, everybody
sees the same event as in the top picture, the state is the same, it's indistinguishable.
Okay? Paxos works that way.
And here's virtual synchrony. This is the model that I happen to be fond of because I
invented it and it's fast and I've always been sort of a speed demon.
Virtual synchrony gives executions which, if you look at them in a semantic sense, are
indistinguishable from the synchronous ones, which in turn are indistinguishable from
the reference runs. So virtual synchrony weakens ordering and weakens certain other
properties, preserves, though, the guarantee that what happens in the distributed cloud
system is indistinguishable from the reference execution.
So my goal in the rest of this talk is to make it possible to use executions like this and to
respond to the concern that this can't scale.
Now, why do people fear consistency? Why do they think that consistency is such a
dangerous thing to have? The main concern is that consistency is perceive as a root
cause of the problems that places like Amazon had. And actually it goes way, way
back. If you had been in the field for as long as I have, almost 30 years now, banks
even 15 or 20 years ago were afraid of this phenomenon where they had trading floors
that would be destabilized.
There's a picture of the type of problem that Amazon was suffering from at the time that
Werner went there. Not anymore. And what you can see is that when they measured
message rates in the background on their cloud platforms, they were oscillating
between saturating the network -- that's at the top -- and dropping to zero over a
significant period of time.
And if you looked at exactly what was happening, they had decided to use a published
subscribed message bus very aggressively, and this particular bus guaranteed delivery,
and when the system loaded up enough, it started dropping packets, but because it had
to guarantee delivery, it would retransmit those packets which created additional load
and essentially is putting reliability ahead of scalability. The extra load caused a
complete collapse, and that's why it went down to zero. After about 90 seconds, that
particular system would give up, which meant all the load drained away and the cycle
would repeat.
So you can understand why people would be afraid of this. If you imagine your, you
know, 12 times a soccer field data center completely destabilized in this way, it's a
pretty frightening prospect.
Now, on the other hand, there are dangers of inconsistency. So if we start to move
mission-critical applications to the cloud, that's going to include banking applications,
medical care applications. Microsoft has a big commitment to moving that direction.
Google does as well. And if you take medical care records, those aren't going to be just
your doctor's records, they're going to include real-time data coming from blood sugar
measurements that are going to be turned around and used to adjust insulin pumps.
That will happen for people who are at home and who aren't capable of doing it
themselves. And because of the efficiencies of cloud computing, it'll be on a scale
where people couldn't step in if it broke down, but obviously inconsistency is dangerous
in such settings.
So if we can't figure out how to reintroduce consistency in cloud environments, then
what we're doing is we're saying that we can achieve scalability, but we can't run those
types of apps. And I don't think we should accept this. And that's actually why I don't
think caps should be viewed as a theorem. It's more of a rule of thumb that's worked
pretty well and gotten us pretty far.
Now, to reintroduce consistency, what we're going to need are a few things. First of all,
a scaleable model. Many people would say, for example, that we should use what's
called state mission replication, Paxos. Isis, the virtual synchrony model, would be
another option.
And we need to convince ourselves that that model itself is compatible with scaling, and
then we have to have an implementation of a platform that can scale massively in all
sorts of dimensions.
And so the rest of the talk, what I want to do is explain why I think this is a solvable
problem.
So I'm known in my career for building the Isis system originally. It was used in things
like -- the New York Stock Exchange ran on Isis for about a decade. During that whole
period you never read about a trading disruption due to technology in the stock change.
Yet that was a system with hundreds of machines, even in the early days. So why did it
never fail?
The answer is it experienced failures, but it was self-healing. It was using the software
I'm going to reinvent, in some sense, for cloud computing now. French air traffic control
system continues to use that platform, the U.S. Navy Aegis, and there were a lot of
other apps that used it as well.
It didn't make a lot of money other than for me, but it did make some money, and it
certain made some people happy, and in particular, stock exchange traders, for
example.
Now, the key to Isis was to support group communication like in the picture I showed
you earlier, and what I'm going to do now is suggest that the way to think about this
today is that these groups are really objects and that the programs I was showing, the
little processes with the timelines, had imported the objects much as if you had opened
a file.
So if you have a group of five processes like I showed you earlier, what that really is is
an object shared, in some sense, shared memory among five processes. They float in
the network, they hold data when they're idle, and then you open them when you want
to use them. And I'm going to reincarnate this now. We'll call it Isis two, okay?
So how would this look? Well, to the user, it will just be a library -- I'm just going to
show you very, very quickly the kind of thing I have in mind. So you basically create the
group, you give it a name, it looks like a file name, the file name is actually a real file
name, and it's where the state of the group is kept when nobody's using the group.
While it's active, the state of the group is in the applications using it. You can register
handlers, and then you can send operations like an update to those handlers. It's
polymorphic. This is all done in Csharp. The corresponding handler is called. I do type
checking. You can't actually join a group if you don't have the right interfaces, for
example.
And what will happen now at run time is that you can use this kind of mechanism to
program in what you'd call a state machine style with strong guarantees that even
though you're not thinking much about at fault tolerance and consistency are
guaranteed by the platform. Security too. We'll talk about that in a second.
So here, for example, is somebody who queries a group and he wants the group to do
some work in parallel for him. And as you can see, he's asking for replies from all
members. Isis knows how to handle that. The operation is to look up somebody's
name. It's a pattern in this case. And when the replies come back, they're in an internal
form, and what's done is we turn that into a call-back here to a routine called lookup.
And what happens on the server side is that the lookup routine is invoked, it does some
calculation, and then each of the members reply.
Here's a picture of how that might look. So here you've to the a querying process -- I
turn the timelines top to bottom now. You've got a querying process on the left. It's
talking to this group of five processes which have imported that object. They may have
imported tens of thousands of other objects. It's not the only things those programs are
doing.
But, in any case, this particular set of threads receives the request, four parallel
executions occur, each of them computes maybe a fourth of the result, and then they
send the replies and the replies are processed by calls. That's the idea. It's very easy
to program this way.
The popularity of Isis in the old days was really because it was so easy to use. You
hardly needed any training to use this model.
So a group is an object. The user doesn't experience all the mess, and it's a lot of
mess, I can assure you, as you try to build these things, and the groups have replicas.
Now, what model are we going to use? In what I'm going to describe, I'm actually
merging the virtual synchrony model with the Paxos model. We did some work with
Microsoft Research, Dahlia [inaudible], and found that in fact we can build a super
model that subsumes the two and actually is faster than either of them and also cleans
up some problems that both models had.
I can say more about that in questions if people want to come back to it.
But Paxos had some issues, especially when it's dynamically reconfigurable. We were
able to fix those. Isis had some issues in the old days of virtual synchrony. We fixed
those as well. We have a submission to PODC in a paper that I could share with people
if they're interested.
So here's the way the platform's going to look. It's going to have a basic layer that
supports large numbers of these process groups, these objects. Applications will join
them. The applications talk to a library. The library has various presentations, virtual
synchrony, multi-cast. You can actually ask it to be Paxos if you want it to. It will
support Gossip as well, so you can use pure Gossip mechanisms if you want and, on
top of that, various high-level packages. Very fast pub/sub, a very, very fast data
replication package, and other things could be put in there too. For example, database
transactions, [inaudible] fault tolerance, overlays. I have quite a range that I'd like to put
in. We'll see how far I get.
Now, for security, I've actually decided that since people worry about that and I'm
aiming at mission-critical apps, I should make that transparent. So simply by requesting
that a group be secured, I will secure the group using keys that are generated
dynamically, and only group members can make sense of the data that's transmitted.
We do compression if messages get very large so that we minimize the load on the
network.
So, now, what's the core of my challenge? It comes back to the problems I talked about
at the outset, James Hamilton and Jim Gray and the Ebay people being afraid of
instability. So why can I build a stable system if previous systems weren't stable?
Now, the core of my challenge turns out to be this. I need to do better research -resource management than has traditionally been done. And I want to talk about just
one example of a problem in this space. There are a couple, but we've solved many of
them over the last few years in research. And that's the use of IP multicast as the
fastest possible way to get replicas updated.
I think everyone would agree that IP multicast, one UDP packet that's received by
several receivers is obviously the fast -- the speed of light for replication in a data
center. But we can't use it because it doesn't work well, and in particular, it's associated
with the constant instabilities we saw earlier.
But we do some studies. In fact, we worked with an IBM research group on that, and if
you look at the top right here, you see a graph that's typical of what we came up with.
We found that if you use IP multicast and send a constant data rate, constant stream of
messages, the rate is fixed, but you vary the number of multicast groups you're using to
send it -- so nothing changes here except the number of IP multicast groups -- in fact,
the hardware breaks. So you can see that happening.
This is a loss rate graph, and you can see that when on average nodes in my data
center are joining about 100 IP multicast groups, this is a perfectly plausible number,
suddenly loss rates spike and they go through the ceiling. They go up to 25 percent.
This, by the way, is what happened with Amazon in the instability problem that they had
a few years ago. The pub/sub product that they were using accidentally wandered into
this space, and with these huge loss rates, no surprise, that as people use pub/sub
heavily, it melts down.
And you can think how insidious this is. Synchronous because you like your pub/sub
product, you roll it out on a larger scale, and then it melts down one day all by itself.
So, now, how do we handle that? Well, what we're doing is we're managing the IP
multicast abstraction. You see the new blue box below my other boxes. And here's
how the management scheme works. It's an optimization scheme based on the kinds of
results people are getting from social networking. And there's an optimization formula.
If I had a little more time, I'd go through it carefully. But basically what we want to do is
decide who really gets to use IP multicast addresses, the hardware ones that are seen
by the data center, and we're going to do that in such a way that we never overload the
hardware limits.
Against that background, we're going to try to minimize the amount of extra work. In our
case that involves not using IP multicast forces you to send point to point. So there's a
cost for sending if I tell somebody that their particular multicast group has to send point
to point. There's also a receiver cost if I use an IP multicast address for several groups,
and some of the groups include people who didn't want some of the traffic. They're
going to have to filter.
And to understand the way this works is actually kind of easy. What I've done here is
I've imagined groups as red dots in a kind of high-dimensional space. That's what's on
the top. And subscribers and publishers are the people at the bottom, maybe grad
students in the CS department. And they join some of the groups.
So, for example, there is genuinely at Cornell a thank goodness it's Friday beer group.
They all go out together and drink. Some people drink, some don't. The ones are the
people who are members of the group that drinks beer.
And if you think about the crowd that drinks beer and the crowd that wants free food,
they may be a very similar crowd. And that's going to correspond to proximity of the
corresponding membership groups, IP multicast groups, in the high-dimensional space.
And the idea now is just going to be to do clustering and assign one IP multicast group
to some set of similar-looking groups. So the little X's represent the IP multicast
addresses.
So I'm going to do that transparently to you or to my apps. And so here we've mapped
some large number of IP multicast groups down to three addresses, but it's at a cost,
right? Because some of these groups didn't have exactly the right memberships, and
beer people are getting messages about food, and maybe they're not hungry, and some
food people don't drink beer and they're getting beer messages.
The sending cost is minimal, though. So now what I can do is I can start to say, well,
find somebody who is kind of an outlier, turn him into using UDP for his group, not IP
multicast. He thinks he's using IP multicast. He's actually forced to use UDP. He'll
actually have a higher sending cost, but my filtering cost has gone down because his
application isn't receiving undesired messages. And then I can repeat that until I no
longer exceed my maximum for filtering costs, my sending cost hopefully is as low as I
can keep it, and I'm definitely not exceeding the hardware limits for the data center
routers and so forth, and so I'm definitely not provoking that massive loss phenomenon.
So this is an example of an optimization result that we're going to use in Isis 2 to map
what people think of as IP multicast groups or what my library thinks of as IP multicast
groups down to a small number of physical groups. I use ideas like this quite heavily.
I'll just comment that we did a study looking at lots and lots of situations where people
have subscription patterns, pub/sub kinds of patterns, and we found actually that most
traffic and most groups have a very heavy tail distribution. You can see for the curves
here on this graph on the right where relatively few groups account for most of the
popularity and actually an even smaller number of groups if you consider traffic.
So a small percentage of the IP multicast use in a real data center covers most of the
benefit. And so it's completely plausible that we can get what turned out to be 100 to 1
reductions sometimes in the number of IP multicast addresses required to fully satisfy
our objectives in terms of speed.
Now, we're doing more. I don't have time today to tell you all about the other things
we're doing. I'll just say a few words about them.
A second problem is reliability. I do have a reliability application and -- I'm not sure what
that is. You can ignore the background noise. And for this reliability goal, you have to
get acknowledgments.
Well, if you send acknowledgments directly to the sender, you get a implosion problems
where. We're using trees of token rings. You get a picture of that here. They have
about 25 nodes each. And for very large groups, it turns out this works out beautifully.
We have a paper on it.
We have to do flow control. The background beeping was actually that I was running
out of time so I'm going to go through this real fast.
Basically another optimization similar in spirit to the first one lets me keep the data rates
for large numbers of groups below target thresholds. This particular picture is an early
version. We actually tended to overshoot.
Red is the traffic we generated, green is the traffic with our agile flow control scheme.
It's a very similar idea to an optimization that says who can send at what rate. And you
can see yellow was our target. Nowadays we are below our target.
So to summarize, we can build a system that's drastically more scalable and in which
most updates occur with a single IP multicast to the set of replicas. And we do this
using various tricks, my last slide.
And what it gives you, then, is multicast literally at the speed of light with consistency
guarantees. And although I didn't have time to talk about it, with guarantees of stability
as well, and theoretically rigorous guarantees of stability, theoretically rigorous
guarantees of security and performance that actually, if you think about it, drastically
exceeds what we get today in cloud computing systems.
And, in fact, I'll end on the following point. The reason that cloud computing systems I
think endorse inconsistency, embrace inconsistency, is that they don't know how to do
replication at a high data rate. Since we replicate in slow motion, you'd better get used
to stale data. And you replicate in slow motion because you're afraid of reliability
mechanisms.
If I can fix that, and the arrangement then is that I think we can, we toss out lots of
pages of code on my side, it turns out, to make things nice and simple, and as a matter
of fact, you end up with the benefit that data is consistent and is being replicated at
speeds maybe hundreds or thousands of times faster than what you're seeing in
modern data centers, which are using things like TCP point to point to move data from a
source to a set of replicas. For me it's going to be direct IP multicast. One replicated
update becomes one multicast. And you can imagine the speed-up, and it's a speed-up
that I'm already beginning to measure.
I've got Isis 2 starting to run here at Cornell. Not quite demonstrable yet, but it will be
soon. And with that, I'll stop and take questions.
And please ask questions at the microphone because I'm not as close as some of your
speakers are going to be today.
>>: Thank you, Ken. Questions in the audience?
We have a timid audience. Any questions?
Yes, we do have one. Great.
>>: Hi. So for doing this clustering to decide which multicast groups you will have, you
have to do that in real-time, right?
Ken Birman: That's right. Yes. We have a paper on this that we're presenting in
EuroSys, and the really quick answer is that it's quite efficiently paralyzable. The paper
is called.
Dr. Multicast. It's going to be presented next week by one of the main authors. And
what we're able to do is to break up the very large structure. Obviously you get a very
large structure. We break it into small pieces and we have an efficient greedy
algorithm, and this can be done at very, very high data rates. And then you essentially
elect a set of leaders which handle sort of subgroups. You can think of it as kind of a
hierarchical version of the protocol. Large clusters within which we allocate some of the
resource to each cluster and then within the cluster a suballocation to the actual groups
that fall into that area.
>>: Okay. So you don't -Ken Birman: By the way, I'll mention that IBM and quite possibly Cisco are already
adopting this scheme. There's no IP in the sense of patents involved, so anybody who
wants to read the paper and steal the idea is welcome to.
>>: Okay. Thanks.
>>: Okay. In the interest of time, we should move on to our next speaker, and let's
thank Ken once again for his talk.
[applause]
Ken Birman: Thank you everybody out there. Again, I'm really sorry I couldn't join you.
>>: So our next speaker is Armando Fox from U.C. Berkeley RAD lab.
Armando Fox: What an ego trip. I get to talk after Dave Patterson and Ken Birman and
I filled the room, even though it's a very small room, but still.
So I'm Armando Fox. I do work in the RAD lab, but I also spend some time in the
parallel computing lab, and I wasn't sure which template to use for this set of slides
because the ideas kind of came out of the par lab, but they've really crossed over quite
a bit, and hopefully I'll be able to persuade you of that by the end of the talk.
This is some ideas that we've had on how to make parallel programming more
productive for people who don't think of themselves primarily as programmers. And
although the ideas came out of parallel and multicore, we think there's important
applicability to cloud computing as well. So I'm going to try to give you a little blend of
both of those things.
And as with all good systems work, this is a collaboration with many people, some of
whom are listed here and some of whom I've now doubt forgotten.
So our goal in the par lab is high productivity parallel programming that actually gets
good performance and is sort of accessible to mere mortals, and we kind of begin with
the observation that everybody knows that these very high level languages like Python,
Ruby, I dare say MATLAB, although it's not my favorite language, people like to use
them.
Scientists, for example, like to use them because the abstractions that you can get in
those languages are a good match for the kind of code that you're trying to write, and
various studies, none of them done by us, have then to that you can get up to 5x faster
development time and express the same ideas in 3 to 10x fewer lines of code. For most
of us that probably means 3 to 10x fewer likelihood of bugs. And we're going to
stipulate that more than 90 percent of programmers fall into this category. And that's
probably a conservative estimate. In practice, I think it's probably even a greater
fraction than that.
At the other end of the language spectrum you have efficiency languages or
efficiency-level languages -- we'll call them ELLs -- C and C++ which when I learned C,
which doesn't feel like that long ago, C was a high-level language. Today C is a
systems language. CUDA, which is the language that's used to program in video
GPUs, or OpenCL which is a similar open language that's coming out now.
These languages tend to take a lot longer time to develop code, but the payback is if
you're a really good programmer, if you understand the hardware architecture and if
you're willing to kind of work around the languages quirks, you potentially could get 2, 3,
maybe more orders of magnitude in performance because you're using the language's
ability to take advantage of the hardware model.
So far fewer programmers, we're going to say far less than 10 percent, fit into this
category. And the irony is that even though -- in some sense these guys are the scarce
resource for getting the benefits of these languages, and yet their work tends to be
poorly reused. They will come and rewrite an app that someone did in MATLAB, they'll
put a lot of work into it, they'll speed it up by three orders of magnitude, and then that
code perhaps never gets used again.
So we think it's possible to do better than that. Especially because just because you're
spending the 5x more development time down here doesn't mean you will necessarily
get the improvement. It just means you could get it. And, in fact, a lot of people don't.
So can we raise the level of abstraction for everyone and still not have to sacrifice at
least part of that performance gain.
Traditionally the way that people have tried to do this is you code your algorithm, you do
your prototype in something like MATLAB or Python, and then you find an efficiency
program, or somebody who's an expert at saying I know how to take this problem
structure, sparse matrix-vector multiply, logistic regression, and make it run really well
on this exotic type of parallel hardware.
So a few examples. Stencil/SIMD codes. Great match for GPUs because of their
natural multidimensional parallelism. Sparse matrix. There's been a lot of work on
communication avoiding algorithms, and that's a great fit for multicore where
communication is expensive. At the cloud level, people who do things like big finance
simulations that are Monte Carlo-like, some of those are a great fit for expressing with
abstractions like MapReduce.
Couldn't you use libraries to do this? Sure, you could. But libraries matched to a
particular language don't tend to raise the level of abstraction. They just save you from
writing some lines of code at that language.
C libraries don't work well, for example, with something like Python just because the
abstractions that you can get in Python aren't expressible very well in C. So it's not
really possible to create a library that expresses a higher level of abstraction than what
the library itself is implemented in.
So given that these efficiency programmers where the experts at doing this are the
scarce resource, can we make their efforts more accessible and sort of reusable by a
productivity programmers.
Traditionally the way that people have approached this problem is, you know, like
everything in computer science, we do layers, we do indirection, so we have a bunch of
application domains in red, virtual worlds, robotics. We identify that there's also these
domains of types of computations. So rendering, probabilistic algorithms. And what
we'd like to do is have all these different types of hardware, which today includes the
cloud platform, able to take advantage of these different types of hardware from those
different domains.
So the traditional solution is, I know, we'll find a run time in LS and some kind of
intermediate language that is flexible enough to express everything up there, right?
These are quite different abstractions from each other. So this language has to be
general enough to satisfy all those communities. And it has to be general enough to
map to these quite different architectures down here, all while getting good
performance.
Not surprisingly, this goal has been elusive. So here's a proposed new idea that
violates layering. It's always fun to give a talk when you say we're going to violate some
sacred cow of computer science because, if nothing else, you get questions about it.
So here's another way to do it. A couple of years ago the group that started the par lab,
which I am not part of, although at the time it didn't include me, identified that there's a
handful, let's call them order of a dozen, a couple of tens of computation patterns that
recur cross many different domains.
So our idea is instead of using strict layering to map code that expresses these patterns
down to the hardware, let's punch through the layers selectively.
So if we have a programmer who knows how to take a stencil code and make it run
really well on an FPGA-type research platform, let's let them at it and they create this
thing. They punch through all the layers. There's no assumption of a common
intermediate language. [inaudible] came up with the name stovepipes, although we
though that that term has negative connotations in the IT industry, but whatever.
Anything for a little bit of controversy.
So this is our idea. We're selectively violating -- it's not all to all, right? We're not saying
we can do all of these. We're not saying we can target every kind of hardware. But
we're stipulating that when we can do it, we'd like to do it in a way that makes this work
reusable.
Oh, there we go. Trial balloon. That was weird.
This is a point where people typically ask why is this any different from the arbitrarily
intelligent compiler problem. This is the simple answer. We assume human beings do
these. Each one of those is probably done by one or a small set of human beings.
There's no implied commonality across here. These two might be completely different
individuals who develop them, so looking forward a little bit, we want to try to crowd
source the creation of these things. So we know these people are out there. There's
not in the majority, but there's still a lot of them.
So here's how we propose to do it. The name of our technique, which is just rolls off the
tongue, is selective embedded just-in-time specialization, SEJITS, and the idea is that
we allow the productivity programmers who write in these high-level languages, but we
have an infrastructure that will selectively specialize some of the computation patterns
at run time -- and I'll get to selectively in a minute. Everything in italics means I'll explain
it shortly.
Specialization takes advantage of information available at run time to actually generate
new source code and JIT-compile it to do just that computation targeted to that
hardware. That's the selective part.
And the embedded part, as we'll see, is you could have done this trick using previous
generation scripting languages, but it would have been a pain because to do it you
would have had to extend or modify the interpreter associated with the PLL. This is not
true anymore. Because languages are now both widely used and tastefully designed,
like Python and Ruby, we can actually do all of this stuff without leaving the PLL. And
that turns out to be a boon in persuading people to contribute specializers.
So let me give one more level of detail how it works and then I'll show you a couple of
examples, because it's actually much easier to see by example.
Here's what happens. When my program is running, written in my high-level language,
and a specializable or potentially specializable function is called, first I need to
determine if there's a specializer that exists for whatever the current platform that I'm
running on that takes care of that computation pattern in that function.
If the answer is no, that's fine. The program is in a high-level language. We can just
continue executing in that high-level language. It will just be slow, but it will run. That's
really important. And I'll come back to why.
But if we do have a specializer, because these languages have nice features like full
introspection, spec actually hand the specializer the AST of the function and it can use
that to do some code generation.
And, remember, the specializers are written by human experts, so they can create
snippets of source code templates that embody their human-level intelligence about
specializing a pattern. What we're doing is source code generation -- syntax-directed
source code generation based on templates provided by human experts, and then we
dynamically link that code, after compiling it, to the PLL interpreter and we can hand the
result back to the PLL. And all of this work can actually be done in the PLL itself.
And as I said before, a reason that I think we wouldn't have thought of this maybe five,
six, seven years ago is modern productivity languages actually have all the machinery
that you need to do this inside the language. If you wanted to do this in a language like
Pearl -- well, there's many reasons not to use Pearl, but if you wanted to, the amount of
hacking that you have to do to the innards of the interpreter to get this to really work is
daunting. With Python and Ruby, that's not the case. And we're betting on Python in
particular for practical reasons that I can tell you about later.
So here's an example of how it would work in real life. We have a productivity app
written in, let's call it Python. Here's the Python interpreter down here in yellow, and it's
running on top of some OS and hardware, which we're not going to say what it is.
Sometimes we'll call a function that doesn't actually do any computation for which
specialization makes any sense, something where there's really no performance gain to
be gotten by calling the function. That's fine. We let the interpreter run the function as it
normally would.
We might also call a function for which specialization is possible. I've used the at sign
notation. For those of you who know Python, it's suggestive of what Python calls a
decorator. It's a way of signaling that this function is special and I'd like to be notified
before the function's called.
So in this case the SEJITS infrastructure will intercept the function call, but it turns out in
this case we don't have a specializer for the function. Okay, we lose. That's fine. Keep
executing in the PLL.
But if we call a function for which there is a specializer, then the specializer can actually
generate new code. For the sake of the example we'll assume that the specializer
knows how to create a C version of whatever this pattern is. That will get run through
the standard C compiler tool chain, the code gets cached so that in the future when you
call the function again, you don't have to redo this step. The .so can be linked to Python
on the fly. Again, you couldn't do this with scripting languages one generation ago,
right? The fact that we can compile, create the .so on the fly and pull the symbols in,
that's relatively new. But it means that we can actually now call the specialized version
of the function and completely bypass the original PLL version.
Come on.
So this is why it's selective, right? Not all functions are necessarily specializable. It's
embedded because we can use machinery always existing in languages like Python to
actually do everything that I said. It's just in time because we're generating new source
code.
By the way, why not just generate a binary? That's stupid. We have a good compiler,
right? In fact, I've used .c with CC as an example, but I'm going to show you some real
examples that are working today where this tool chain is a lot more sophisticated. If
you've done any GPU program with something like CUDA, the CUDA compiler is highly
non-trivial, right? It uses these ungainly C++ templates to convert them into
multithreaded CUDA code. That's a lot of post processing.
So this is a simple example. But in fact, a lot of work has also gone into this tool chain,
and we're able to leverage that directly.
And, of course, specialization means that instead of executing the function as originally
written in the PLL, we're going to execute the specialized version.
So since I said this was easier to see by example, here's a couple of examples. Don't
worry if you don't know Ruby. These are examples that are working now.
Here's an example that takes a stencil computation and attempts to run it on something
like a GPU. This is straight Ruby code. If you run this, it will compute this
two-dimensional stencil over the one neighbors of some grid. This is a really simple
function. We're just multiplying each point by a constant. No big deal.
But when the function is called, this actually subclasses from a specializable class that
one of our grad students created, and what will happen is the function will be handed
the entire AST of this computation from which it can pull out things like, one, the radius
of neighbors you're doing the stencil over, it can pull out the AST corresponding to this
intermediate computation. Whatever functions you want it to do in here -- this is a really
simple one, but you could have arbitrary code in there defining what stencil you want to
run, and using that AST exactly the same way as a compiler would, you can actually
emit code. In this case the code is emitted for Open MP, and this is what it looks like.
I've stylized it a little bit to make it easier to read, but it's basically cut and pasted.
So you'll have to take my word for it that semantically this computes the same result as
that. The only question for you is which one would you rather be writing.
So we're able to get -- and, by the way, because we know things like the dimension of
the stencil, this is information you don't know at compile time, right? This could be an
arbitrary expression. The fact that we can pull that out at run time means that if we
have different -- entirely different source code templates, for example, for compiling the
GPU different based on that constant, we could pick the right variant at run time too.
That's something you can't do at compile time.
So the specializer emits Open MP in this case. Not surprisingly, the compiled Open MP
code is about three orders of magnitude faster than doing it in Ruby, which is the entire
point. And, again, remember, if any step along this way fails, if we have problems
compiling, if there's something in this function that renders it possibly unsafe to
specialize, for example, maybe I put something in here that can't be proven not to have
a cross loop dependency of some kind, fine. We throw up a flag, you run it in Ruby.
Here's another example. This one was just -- we got this working, like, two days ago.
So this is sparse matrix-vector multiply in Python. And the idea is I'm going to define a
sparse matrix-vector multiply function where I have -- Ax is just an array of the non-zero
elements of the matrix -- of the indices of non-zero elements. Ax and Aj are what I'm
multiplying.
And the logic is really simple, right? I have a sub function which takes a column and
multiplies it by the vector. And the way you do that is you just map the multiplication
operator over the column in the vector, and then I just run a map to do one for each
column.
So this is a pretty standard way that you would express in functional terms how to do a
sparse matrix vector. It just says gather all the non-zeros, multiply them each vector,
and just repeat that thing for each column, right?
What happens if you run this through the copperhead specializer? So this specializer is
actually smart enough to do the code analysis to figure out the gather operation is
actually supported by something in the CUDA libraries. It generates automatic C++
templates -- so if you're a C++ template fan, God bless you. If you're a scientist, you
should run away screaming from this, because nobody really knows how to use C++
templates well. That's my theory.
So what's happening here is it's actually generating code that's going to go into the
CUDA compiler that uses C++ templates which the compiler will turn into the right
CUDA machine code.
So this is a good example of why you don't want to generate binaries directly. There's a
lot of work involved even getting from this to CUDA, so we've actually raised the level of
abstraction significantly by doing this.
So kind of the message of this example -- in real life we'd probably just package up the
entire sparse matrix vector as its own function, but the example is useful for showing
that the ability to leverage downstream tool chains that do this could actually be
significant leverage.
One last example. So let's actually talk about the cloud now. All I did in this picture -this is the same one I did before, but instead of C, I put Scala and the Scala compiler.
Anybody familiar with Scala? More people than I thought.
It's a very tasteful quasi-function al language that has the brilliant engineering decision
that it compiles to JVM byte code. So they can take advantage of all of the Java
infrastructure for running Scala programs.
One of our students has created a package called Spark which is an extension of the
Scala data parallel extensions API for doing MapReduce kinds of jobs primarily targeted
at machine learning. So all I've done is I've replaced the C compiler tool chain with a
Scala compiler with these Spark framework that this student has developed, and it runs
on top of a project called Nexus, which is a cloud OS that's being developed in the RAD
lab.
So what does Spark give you? Spark gives you cloud distributed persistent fault
tolerant data structures. What this really means is if I want to run something that looks
like a bunch of MapReduces, I don't have to do a disc write and a disc read between
operations. And I lose one of the nodes, Spark can reconstruct the lost data because it
knows the provenance of the data. It's kind of a neat set of stuff.
And it relies on Scala. So it's written in Scala, it relies on the Scala run time. And it
relies on Nexus which is this cloud resource manager.
Okay. So why show all this stuff? Because we have another example.
Here's logistic regression in the cloud. Here's, again, a Python logistic regression
function. You see the Python at sign syntax, which means intercept this function for
possible specialization. Logistic regression is conceptually pretty simple. I have a
bunch of points. I want to find a hyperplane that separates them. So I basically start
with a random hyperplane, and at each iteration I compute a gradient and I move the
plane a little bit so that the separation gets better. And I do that for some number of
iterations.
We're on the way to having a specializer that will take that and generate this. This is
Scala. That's a pretty nice language. But what's interesting to notice here is that we
can figure out, because the technology already exists in Copperhead to do this, that this
operation is really just a reduction. It's the only thing inside the loop. We have a single
initialization step and a single accumulation step. So this is really just kind of a
MapReduce operation, right?
And the amount of code analysis you need to do this is not that much, and it's already
largely in Copperhead. So you might say, well, Scala, actually this is a pretty high level
of abstraction. What's the benefit, really, of going from here to here? With a CUDA
case, you could see why you'd rather write this. But with this case you could argue,
well, maybe I just want to write it in Scala, right? Scala's not so bad.
And this is true, but the difference is that -- the difference is that once you generate the
Scala code, now all this machinery for doing resilient cloud distributed data structures
which was designed only to work with Scala now works with Python. The other
example that I gave before using the GPU was also in Python. So I could actually mix
different kinds of hardware platforms into the same program and in fact the program is
agnostic as to which one it's using.
Within a single Python program I have one chunk of stuff that will end up specializing to
CUDA for the GPU, I have another chunk of stuff that may end up doing a cloud
computation. And, by the way, if I'm just running this on a vanilla laptop that has neither
cloud support nor a GPU, it just still runs in Python.
And, by the way, just for MapReduce fans, just for completeness, I had to have this
slide. I've now replaced Scala with Java and I've replaced Spark with Hadoop. This is
probably a nicer way to run MapReduce jobs. Python has map, it has reduce, and you
can run a little Python program on your laptop, and when you're ready to go to the
cloud, you just run it on top of a SEJITS-enabled machine that actually knows how to
talk to the Hadoop master and map and reduce in Python become Hadoop and Hadoop
reduce.
Yeah, nice animation.
So why do we think this is an exciting idea for cloud computing? There's a few different
reasons. But really the most exciting one is that you could plausibly have the same
application in a language like Python that runs on your desktop on exotic hardware like
manycore GPU or in the cloud.
If you're doing things like building clouds that have machines with GPU cards in them,
there's now the opportunity to do two levels of specialization. You could do per-node
specialization targeting multicore GPU, but you could also identify computations that
would benefit from being farmed out to the cloud. So you could emit JITable code for
something like Spark, like I showed, or for MPI. And at the single node level, you could
take advantage of things like a GPU with CUDA or with OpenCL. So you're combining
different abstractions targeted at different kinds of hardware, but you're doing it with a
common notation that you're doing it in the context of one app.
And this is in red for a reason, right? If anything goes wrong, you still have working
Python code that will run in a stock Python interpreter with no other libraries. So one of
the big benefits we believe is that this gives you an incremental way to start testing out
new research ideas.
In fact, in the par lab we have an FPGA-based emulator called Ramp that is designed to
be a research vehicle for testing out parallel hardware architecture ideas, and rather
than having to, you know -- to be able to run a real benchmark, we don't need to get an
entire compiler tool chain and everything else all up and running. All we need to do is
get CC and load .so running and then we can decide to create specializers that target
the subset of the hardware whose features we want to benchmark.
And, in fact, we've actually done this. We have a chunk of an image segmentation
application that is running on our ramp-enabled hardware, and it does it with
specializers that emit a subset of Spark V8 code for it.
I think I wanted to leave a bunch -- no, wait, these are our questions.
Let me forestall some questions you may have and then we'll take other questions.
So questions that we've gotten -- this is, by the way, very early work. And the examples
that I showed, the first two examples worked today. The third one is on its way to
working.
One question that we get is don't you need sort of an arbitrarily large number of
specializers to do this? We believe that that par lab bet of motifs that actually apply to
many applications, if that bet is correct, then the implication is having tens of
specializers will actually help a lot of people. So even that would be a useful
contribution.
Why is this better than something like libraries or frameworks? Well, we love
frameworks, and we think it is complementary to frameworks, but as I said about
libraries, if you have a library that's written to be linked against an ELL like C or C++, it's
difficult to imagine that the library will export a higher level of abstraction than what is
conveniently available in the ELL. So libraries may save you from writing code, but they
don't save you from having to work at a lower level of abstraction that you might
otherwise like.
I think I already mentioned, why isn't this just as hard as the arbitrarily smart compiler
problem? It's because it's the people who are arbitrarily smart. What we're trying to
do -- I hate using terms like crowdsourcing, but someone suggested it and now it's kind
of stuck. But we're trying to package the work that these experts can do in a way that
makes them reusable. If we can get -- imagine the open source model where you've
got different people contributing specialization modules to some online catalog and you
can decide which ones you want to download. That's the direction that we hope and
expect that this would take.
Possibly a more interesting question is, you know, our target audience is largely
programmers who today are using things like MATLAB, heaven forbid some of them are
still using Fortran, and some of the examples that I showed definitely are functionally
flavored or they use functional constructs, they use list comprehensions. Are these
programmers really going to learn how to do that?
I think that's an open question, but I also think that there's kind of a 20 percent of the
effort in teaching people about the functional way of thinking about things that will take
you 80 percent of the way, and you can ask me afterward about a program I have in
mind called Functional Programming on the Toilet that I believe will be a final
educational campaign for this. If you've ever been to Google, they have this testing on
the toilet -- ask me afterward.
But we believe that a modest amount of education will go a long way. In, in fact, the
example that I showed in Python from Copperhead where we're specializing to a GPU,
there's also a fair amount of code analysis there that does things like simple loop fusion.
So, again, we don't want to go down the slippery slope of having to build the world's
smartest compiler, but we think there's an amount of education that will make a huge
difference here. And happily for us, we work at a university where education is one of
our supposedly top priorities. So we have an opportunity to get this way of anything
engrained at a relatively early stage in students' careers.
So we believe that SEJITS will enable a code generation-based strategy of specializing
to different target hardware, not at the level of your application but at the level of each
function, presumably. We think that there's a possibility that this could be a uniform way
to do programming with high productivity from the cloud through multicore and
specialized parallel architectures like GPUs, in part because we can combine those
multiple different frameworks into the same app.
And as we've said, even in the par lab so far, it's been a research enabler because as
we develop and deploy new hardware and new OS, we can incrementally develop
specializers for specific things that we want to test and let the other stuff just run in the
PLL because we don't really care too much about its performance.
So the idea that you don't need a fully rebus compiler and tool chain just to get research
off the ground, we've already seen some benefits from it, and we think that there are
probably more waiting down the line.
So with that, I think I have timed it so that there's order of five minutes to discuss stuff.
And thank you for bearing with my melange of different topics.
[applause]
Armando Fox: Don't applaud until after the questions have been answered.
>>: You mentioned that you could write these kind of specializers in the productivity
language. Isn't the population of people you're targeting to write the specializers people
who want to write in C rather than in the high-level languages?
Armando Fox: Well, I said you could write them in the PLL. I said you don't have to.
Having said that, the grad students who have been writing specializers have said that
they much prefer writing the specializers in Python. Right.
>>: [inaudible].
Armando Fox: No, no, the grad students are not typical. But I think, by definition,
specializer writers are not typical.
>>: [inaudible].
Armando Fox: Right.
>>: So your specializers depend on the annotations that we saw there to know when -Armando Fox: At the moment.
>>: But the idea is ultimately you want to recognize patterns in the code or in the AST
or something like that?
Armando Fox: It's an open question how far the automatic recognition part will go.
Right now we're relying on things like Python and Ruby are object oriented, so if you
subclass from a specializable class, you'll get that for free. But I don't know the answer
to how far -- how automatic we'll be able to make it without annotations. We've been
asked that.
>>: [inaudible] invoking the specializer.
Armando Fox: Right now you have to know to put the magic key word to annotate your
code, yes.
>>: [inaudible] but I have a question. So actually the attempt of specializing [inaudible]
same layer of architecture that you're using has been done with Temple in '9seven and
in the [inaudible] and the hard part -- there were two hard parts. The first one was
debugging. Yeah. The second one was a problem you mentioned before. So how
much specializer do we have to write down?
Armando Fox: Well, time will tell if we're right. But the debugging one is a very good
question. So we have been working with [inaudible] who's a languages and language
engineering faculty at Berkeley. One of the approaches we're looking at is when you're
writing the code in the productivity language, there's a certain amount of instrumentation
or debugging or test coverage metrics that you would like to capture at that level, but
once it's specialized, it's not clear that the correspondence -- you know, there may be
no correspondence between abstractions up here and abstractions down here. So
we're looking at are there AST transformation techniques that will be able to either
preserve some of that information or give you better coverage.
So I know I have good test coverage in the PLL, but I want to make sure that when it
runs through the CUDA specializer, I want the equivalent C0 coverage at the CUDA
level. We believe that there may be automated techniques that can at least tell you
what you're missing and help you generate those test cases. So it's not a complete
answer, but we know that debugging and correctness when you go through this
arbitrarily hard transformation is going to be a serious challenge.
Did you have in your hand up before for a while?
>>: I didn't understand why you need to [inaudible].
Armando Fox: I always wish I had more slides to show, but there's a slide that would
have kind of what you -- the extreme that you're talking about we think of as auto-tuning
where you've actually done -- at compile and install time, you've done a bunch
benchmarks, you've figured out code variant that work well on this machine, and you
have a relatively small number of them around, and then at run time it's just a matter of
picking the right one.
One of the things we showed -- let me go back to the example because I think it will
answer the question pretty well. This one.
So the student who actually wrote the stencil specializer, it turns out that depending on
what the value of this constant is, you might choose to tile the GPU completely
differently. So it's not a matter of there's a change -- he would use a completely
different source code template, he might use a different strategy, depending on the
value of something that might not be known until run time.
In this example I put the constant 1, but this could be an arbitrary expression. So I think
in cases where you have enough information to use a pre-compiled compile-time
generated code variant, you should do so. And, in fact, people who do auto-tuning
libraries do exactly that, and we are in fact using SEJITS as a deliver vehicle for making
auto-tune libraries available to high-level languages. But we believe there are enough
cases where taking advantage of run time information, you'll be able to generate much
better code.
>>: [inaudible].
Armando Fox: So do you need K times N specializers in order to cover K patterns on N
platforms?
>>: [inaudible].
Armando Fox: Oh, I see. So if you have, you know, you know an integrated GPU and a
separate on-card GPU, which one do you -- we don't know. Not yet. We would love to
have that problem, because it means we have more specializers working, but we don't
know yet.
>>: We'll have to have other questions at the breaks. It's time for our next talk.
Armando Fox: Okay. I'll be here at the breaks. So thank you for your attention.
[applause]
Jan Rellermeyer: Okay. So I will talk a little bit about my ideas of elasticity through
modularity and [inaudible]. I'm a Ph.D. student at the Systems Group, ETH Zurich,
working with Gustavo Alonzo [phonetic] and [inaudible].
So if we look at elasticity, that's maybe the key reason to go to the cloud, as we also
heard in Dave Patterson's key note today. Elasticity is the ability to acquire and, equally
important, to release resources on demand, because, I mean, once your peak load
goes down, you don't want to be over-provisioned. So you have to have the ability to
also scale down your resources.
If you do this at an infrastructure level, well, we have solutions for this that work pretty
well, like MSN EC2. It's all based on virtualization, isn't it?
For the problem of software, I think that's a lot harder because commodity software is
not really meant to be that elastic. It's not written in a way that it supports this idea of
elasticity very well. Most of the time it's a chunk of software and you can tailor it through
a certain deployment, but that's about it.
Okay. If we look why this is the case, well, we come from a world where basically we
have a single large system and the software is written for a single large system. On this
system we have memory, we have complete cache coherence, so we can do all the
magic that we want.
Once your system -- or once the application has grown to a certain size, maybe it won't
run on a single machine anymore. So we are going to a distributed setup, and then we
also have to make some design choices. So maybe we go to a three-tier architecture
kind of style. We would choose the partitioning that supports this kind of hardware
setup better than what we had. But that already restricts what we can do. It's not the
magic -- the same kind of magic anymore.
If we then go to the cloud where maybe for a programmer, a cloud is like a worst case
distributed system with things can fail all the time, where you have to think of all these
things that can happen, there are certain programming models that work well with
elasticity. For instance, Hadoop, MapReduce or Dryad. But it's not clear for a general
purpose application how to reach the same level of elasticity out of the box.
So what is elasticity about in software systems? Well, we would like it to be maybe as
elastic as a fluid. You know, you put it into a certain glass and if you change the shape,
it will just redistribute itself. There's nothing that you have to do about it. It will just
distribute itself.
At the same time, ideally it would be like delocalization, like I don't even have to care
where my software currently is. It's just there, and I can talk to it. And in anything bad
happens, if I put some forces to the system, well, the system will just redistribute itself
and heal itself and, you know, these are the kind of properties that we would like.
So I think the key to achieve something like this is actually modularity. Modularity has
been discussed like in the 1970s, mostly as an idea of structuring code. It really came
from this thinking about how can we structure code in a better way so that we can reuse
code and so on. Most recently I think it more the focus of modularity to move this to the
deployment time, so to say, to separate the programming in the small from the
programming in the large.
And this is exactly the kind of composition problems that were also mentioned in the
previous talk. If you have certain things that are written in a domain-specific fashion,
that's very good, but you have to somehow make them interact. And I think modularity
has shown that this is possible.
Most recently there were some things like inversion of control and plain-old Java objects
which kind of implement the idea of leaving modules as vanilla as they can be and not
tying them to a specific communication platform, not tying them to a specific
implementation of anything, but use an intelligent run time or a container to inject these
functionalities into the modules.
The good thing about modularity is all the tradeoffs are very well understood. There is
this basic idea of coupling versus cohesion, so if you design a good module, you want
to have it as low coupled as possible to the rest of your system because then all your
change management, all your code management is always local and it won't affect the
rest of your system. That's the composition idea.
At the same time, we want it to be highly cohesive so that if you're looking for
something, you know where to look. You don't want things to be coincidentally
co-located in the same module. If they are, just not the same, if they are two different
kind of things.
So if you go back to my previous example, I think this is exactly how modularity maps to
this. Isn't this kind of cohesion? You want these small functional units that can
redistribute themselves. And for this you need this fine granularity. Otherwise it's not
possible. And isn't this exactly low coupling? Because if things are low coupling, well,
then I can change the structure of my system at run time, and even without letting my
application running on top of this layer noticing, because I can rebind these services
and so on, I can rebind these modules in any possible way that I want.
So I have taken this idea and tried to apply this to a complete modularity system. I've
chosen OSGi for a couple of reasons, mainly because I'm most familiar with this. OSGi
is the dynamic module system for the Java language. It's an open standard, pretty well
supported in the industry. For instance, the newest generation of application service,
some of them not even yet released, will -- most of them will be based on OSGi.
There is the Eclipse IDE, which is just the most widely used OSGi applications. But
also in embedded software or mobile phones, OSGi has gained quite some traction.
In OSGi, modules are called bundles, but for practical purposes, you can think of them
just being Java JAR files, so kind of a deployment unit that you can take and deploy to a
system and run it. The important thing is they contain additional metadata, and this
additional metadata can be a lot, but the important part is they explicitly declare their
dependencies. So that's, on the one hand, kind of a penalty for over-sharing things.
You have to declare them so you will not tightly couple your system too much. You will
at least notice.
On the other hand, the run time system, which is called the framework in OSGi, has a
lot of knowledge about how your modules actually are interconnected. And this
knowledge was originally introduced because the first application for OSGi was the
update problem on long-running embedded devices, like television set up boxes, you
buy them, and once in a while your operator has to operate your software and you want
to do this in a way that not your entire system has to be rebooted.
And knowing these kind of dependencies permits the framework or the run time to do a
selective update of those components that are actually affected and leave the rest of the
system untouched.
And that's one of the reasons why it also becomes popular with large-scale enterprises.
It's a solution to the update problem, it's a solution to the extensibility problem, because
you can incrementally add more modules while your system remains running, you can
update them, you can remove them from the system, which in traditional Java is not
possible. You can't remove something from a running system. You have to throw away
the class order. And that's what OSGi does. It takes a single class order per value.
And that's also the idea of isolation, because, of course, once you do this, essentially
you cannot talk between bundles anymore unless you explicitly shared your code, and
this shared code is then, by the framework, handled in a way that you delegate your
calls, your class loading calls, to the class order that is [inaudible] for this bundle. But
that's more internal.
That's still tight coupling. If you structure a system like this, you will gain some flexibility,
but that's not the kind of elasticity that I am targeting. What you can do, though, is to
introduce something that is a more loosely coupled way of structuring your system, and
that's a OSGi achieved through services.
Services are a lot different in OSGi than in big SOA stacks, because actually in OSGi
they are very lightweight. A service can be an arbitrary Java object that you just register
on one or more interfaces. But once you make the handshake with the run time system
and you have acquired a service, what you get in return is just this Java object, so
there's no penalty involved at this time.
And that's why it became popular in embedded systems. There is some overhead
involved just asking for the right kind of service, doing some filter matching to get the
thing, but once you have a handle on it, it's just an object. It's not more.
And then, of course, OSGi, since I told you you can update and remove things at run
time, it's a very dynamic platform. So in the application model of OSGi, there is this
dynamism built in, deeply built in, into the system. A regular OSGi application that
wants to play well has to monitor the systems state to subscribe to certain events. For
instance, once you hold a service, you should listen to events. If the service goes away,
you should be prepared to just fail over or do whatever you can, but you have to listen
to these kind of things.
So that's the running OSGi system. You can consider that to be an application
consisting of two models, your bundles and the bundle on the right uses the bundle on
the left through a service that has been registered with a framework that's a central
service registry. That's a standard kind of SOI, I think. That's why that's why Gartner
[phonetic] calls OSGi the in-VM SOA. It's supplying the same kind of things to a small
scale virtual machine.
The system is dynamic. I told you this. But it's still far away from anything that we could
expect in the cloud environment.
So why are software modules anyway interesting for the cloud? Well, first of all, the
cloud, it has a problem of software -- sorry, of software component lifecycle, hasn't it?
You want to provision your software at any time to the cloud, to your cloud application,
and maybe you even have to update your cloud application on the fly. And that's
something that the OSGi kind of thinking applies very well to.
There is also the problem of composition. Once you have developed certain things that
you want to reuse, well, you have to make them communicate, and services is a very
intuitive way of designing interfaces between your components so that they can talk to
each other.
But now I said OSGi works only on a single virtual machine. Well, that's the state of the
art as it was. In the cloud, we typically have, like, a big data center, a big distributed
system where things just randomly fail.
But for these kind of purposes we earlier developed an extension of OSGi which is
called R-OSGi which can transparently turn any kind of service invocation to a remote
service invocation. And the nice feature about this is since local OSGi applications are
already prepared to handle the removal of a service gracefully, all that we have to do in
kind of, let's say, in case of a network failure is to map this consistently to these kind of
event that the application can already handle. So we can, for instance, just remove the
service proxy, and that would be equivalent to an administrator who has just removed
the service from the system, because that's what has happened from the perspective of
the client.
So that was the first step. But, of course, we want to go further. And this further means
that we want to assimilate much of the complexity that arises from such a system and
from programming such a system into a run time system so that the knowledge can be
reused at any time. And that also means that we want to try to keep modules as plain
and vanilla as they are and as little as possible tie them to a specific communication
paradigm, to a specific consistency model, to anything that is really specific to one
deployment, because also one of the ideas in cloud computing is that your system will
change. It will hopefully change from a small deployment to a large deployment over
time, but that also means that you have to adjust the tradeoffs that you made in every
step of this evolution.
And there are a couple of not very highly cited but still interesting papers from
practitioners from different cloud platform consumers that just explain how their system
setup has changed when they went from a million users to 2 million users to 10 million
users. The bottom line is this essentially rewrote half of the system every time they did
one of these steps.
And I think that's the challenge of software engineering, to evolve -- to avoid this
one-term programming and go to a more portable way of expressing things.
So we have tried to implement this in a prototype system that is called Cirrostratus, and
that you could consider to be a run time system for elastic modules. That's built around
the OSGi view of the world. It supports modules and services. But most importantly, it
presents itself to a client as a single system image.
So if you deploy a Cirrostratus to a set of notes, let's say on EC2, you can talk to any of
these nodes and you will get a consistent picture of the system. In the same way, of
course, you can add some new nodes and they will just join this overlay that the run
time forms.
The stack required is a JVM and then a local OSGi framework. Because, of course, we
wanted to leverage as much as possible from what a local OSGi framework can already
do. What we put on top is a virtualization layer essentially that deals with all the
distributed systems behavior that we need.
By doing this, we end up having virtual modules and services, which is a lot different to
a traditional application. And I will explain this later. But we also have into the problem
of state that has to be handled in our system, and we have some thoughts about how to
handle this.
But in order to be really elastic, it's not enough just to deploy this thing and then hope
that it will run well. We have to continue to monitor the system. We have to
continuously see how the system performs and if there are any new resources that can
be acquired. And for this we are instrumenting the code a little bit and triggering
redeployment whenever a controller tells us to do so.
So that's the big overview. From an application perspective, that's the view that an
application sees. So if this is your fancy application that is written in a certain module
away, let's say it's consisting of three modules, that's the expectation of single system,
isn't it? You want to see these three modules and they are, let's say, connected to
services.
But that's only the virtual deployment. That's like the contract between the application
and the platform, implicitly expressed. What we can do on the physical layer is we can
transform this into any kind of physical employment there that fulfills this implicit
contract. We can arbitrarily replicate modules as long as we do the bookkeeping in the
back. We can arbitrarily choose to co-locate modules on the same machine. Because
here we have a heavy usage pattern, maybe here it makes sense to co-locate these two
modules.
If we have a module here that has some valuable data and we need high availability, of
course we can create a lot of replicas of it and keep track that we never run, let's say,
under five replicas. But the important message here is this can all change over time,
because your application is not a stead workload. If this goes down to a very low usage
pattern, there's no reason not to move this module away to a different machine in order
to maybe get a more scalable system.
So these are all kind of tricks that we can play, and it's kind of transparent to the
application because the application sees this virtual deployment, and what we do under
the hood is the physical deployment that can react to all these network and platform and
workload changes.
So now I told you the big problem here is, of course, state, because on the physical
layer, of course it makes a huge difference if you have one service or ten services. If
you just constantly invoke functions on one service, well, all the replicas will not see the
same consistent picture. So you need something that deals with the state.
And if we go back to my kind of examples that I gave for scalable cloud applications or
distributed system, most of them are actually built around a very specific and tightly
expressible model of state. I mean, in Hadoop and in Dryad, it's very explicit. You can
see these boxes and how state is passed through these.
In three-tier, we try to be as stateless as possible. We externalize most of the state into
the database and then we just keep a session. That's the model of the world in these
kind of things.
And, of course, for our system, feel free to pick anyway to express your state, and you
can write adapter to make this accessible to the system and you are done. But I think
there are also kind of applications, maybe legacy applications, where you can or don't
want to use this or do this.
And for this we have tried to take an extreme in inferring this state from these modules
just to see how far we can drive this vanilla module kind of idea. And what we do with
these modules is actually we will do an abstract interpretation based on the idea of
symbolic execution.
So once you load a module, there is a small interpreter that will operate on symbols,
sweep through your code, and try to apply all your instructions to the symbolic stack and
to a symbolic heap.
So what we will get in the end is at any time here in this small little program we will see
what kind of impact this instruction has and we can filter out those who have an impact
on state and those who haven't. And that's what's important, because we don't want to
replicate the entire heap. That's much too expensive. We just want to replicate those
operations that are actually dealing with distributed state.
So that's a very simple example, and maybe here you can read that I'm accessing
fields, fields of the service, and that would be the object oriented way of expressing
state, isn't it? In an object oriented language, much of the state at run time is kept in
fields of the servers and in the transitive closure that is attached to all these fields.
But this can get arbitrarily complicated because, I mean, it's a high-level language, so
you can you [inaudible] your state in any possible way, you can call into different
methods. Symbolic execution is capable of filtering out much of this.
There is a limit, like if you're doing purely work through calls, the interpreter will not be
able to figure out what will be the target of your call at run time. So there are some
tradeoffs involved are but in practice, it works quite well for a lot of applications.
And once we have gathered this insight into the structure of the program, well, that's
exactly what the application was designed for if it would run on a single system. So we
can transform this into an application that runs now on a distributed system by just
instrumenting the code, rewriting the bi-code, and now weaving in exactly the kind of
state replication mechanism that you choose. But the important thing is you choose it at
run time.
So, for instance, you can make this entirely transactional, you could do a weekly
consistent represent occasion schemer, anything that you like. The knowledge about
where the state is stored, that's the key thing that you need, and you can rewrite the
module.
And since we are already rewriting it, it seems natural that we also introduce the
performance probe at this time so that we can continuously monitor the performance.
This is also very specific to the kind of replication that you use because choosing one or
the other replication schemer has some impact on, let's say, the network utilization and
so on.
So we can do this all at the same place. In the end, what you would like to have is a
controller that reads all this performance data and then triggers some action.
In our platform we have some actors built into the system that a controller can call, but
for practical purposes we are not saying we have the best possible controller. We
believe that most of the time the controller should be something that is written
specifically for an application because that's actually the glue code that tells you how to
do the specific part of your application and how to turn this into an elastic deployment.
What we will give you is the platform to do so and the promise that you do it once and
you don't have to do it again if you change your deployment.
So we have implemented some smaller use cases. That's actually one of the larger
ones that is a bit interesting. We have taken an online game written in Java which is
called Stendhal. That's a typical client server kind of online game. So you have a big
Java server and a client application and -- well, you connect to the server and then you
have some interaction and some protocol going between the client and the server and
you can play this game.
Well, we turned this into a modular system in the simplest way that you could imagine.
So it's trivial. We just took one module to be the client, one module to be the server.
That's not really fine granular. That's not what we envisioned, but it's kind of the
baseline of what we can do.
And, of course, we replaced all the explicit communication through just services
because that's the idea of loose coupling. You don't want to specifically use a protocol,
just call the service and let the run time figure out how to map this tool call to the server.
Okay. So if you see, this is the setup running one client, one server, and if we measure
the -- I think this is here the latency and this is the traffic on the network. Well, that's the
picture that you get when you run this as a client server application.
And now time moves on and a second player joins the game. While the latency doesn't
really change a lot, of course, the aggregate latency has doubled, but we see we get
some more network traffic, of course.
Okay. Now, this application is very interesting for our purposes because the amount of
state shared between these two clients is actually a function of the game. Because if
these two players are pretty close to each other, there's a high chance that their action
will kind of interfere with each other, that there are some state changes that the right
client does and the left client wants to see immediately.
But if they are playing in completely different worlds of this game, there is almost no
shared state at all. And we think that there are quite a lot of applications that are
inherently working with these kind of operations. It's not a static thing, this shared state.
It's really a function of the workload.
So in here now, since maybe we have -- well, let's say we try now to replicate the server
to the left client, because the system has noticed that the potential traffic that is caused
through the replication of the items in the server is lower than the traffic that we
currently have in the client server fashion, then we get to this picture.
So here we have increased the latency for this client because it now has a longer path
to the server, but, of course, the latency for the left client has significantly dropped, and
we get a different picture in terms of network utilization.
And now if we are in the situation that actually they don't share too much state, we can
also completely let out the server and just gradually transform the system into a
peer-to-peer system.
And in this example, the players, as we did these measurements, were not very close to
each other, so until the end you see that the aggregate latency has dropped and that
the aggregate network utilization has also dropped. So the some was able to optimize
this particular deployment and reach a benefit.
Yeah, so currently this is all done in Java, but, of course, we want to generalize many of
the concepts beyond Java and OSGi. We have kind of done this for OSGi services and
implemented them in, let's say, C and nesC for tiny OS and made them communicating
with a traditional OSGi framework through the R-OSGi protocol, but that's just one side
of the thing.
What we would like to have is a kind of OSGi run time for completely different
environments, and we are currently importing many of the ideas to the .net CLR. And
that's a project supported by the Microsoft Innovation Cluster for Embedded Software in
Switzerland.
The next thing, of course, we want to build more interesting applications. That's why
we're purporting parts of the .net compact framework to Lego Mindstorms which we
have previously used successfully in Java, and we want to kind of build swarms of
intelligent robots that are interacting through this kind of shared run time abstraction and
see how we can simplify the development of applications for such a complicated
distributed system through this.
And, of course, my personal future work is to graduate this year, and I'm confident to do
so.
Okay. So this brings me to my final conclusion. I hope I managed to -- well, to bring the
message that software elasticity is challenging. I think that modularity is the key
challenge, actually, for facilitating elastic deployments of software. Especially elastic
redeployment. And I hopefully also have shown you that we can actually mitigate some
of the complexity problems and put them into an intelligent run time like we have shown
in Cirrostratus.
Thank you.
[applause]
>>: We have time for a couple of questions.
>>: [inaudible].
Jan Rellermeyer: I don't think that you can build software without implicitly modularizing
your software. Because what you're doing is you're still isolating functional units from
your system and presenting this as a service. In OSGi there is a run time system
behind it, and the reasons are that you want to do this in a flexible way, you want to be
able to control the lifestyle of your deployment. You could live without this, but I think
that it has a lot of benefits for cloud environments to do so. But, yeah, the answer is try
SOA without modularitization, I wouldn't know how to do that.
>>: [inaudible].
Jan Rellermeyer: Yeah. Okay, that's a different question. You can try to design your
system entirely stateless. The question is does this really simplify your program
development. I'm not always sure. Because then you're pushing a lot of complexity into
dealing, let's say, with the session and acquiring the right state from the -- I mean, the
state hasn't disappeared. It's just externalized, you know? But as I said, I mean, I'm not
advocating the state inferencing as the one and only solution. You could also employ
your own way of describing or even avoiding state.
Okay.
[applause]
Rosa Badia: Thank you, Roger.
So my proposal is to present a view of how we can program clouds. So I'll start a bit
explaining what is a star super scaler programming model and then I'll more specifically
move into the COMPS superscalar framework, which is kind of the [inaudible] of this
programming model and also present an EMOTIVE cloud, which is the software solution
that we're developing at BSC for clouds. And then how we see to [inaudible] the
superscalar programming models towards the use of clouds and service oriented
architectures.
So the idea of the star superscalar comes from the fact of the superscalar processors.
So most of you already know about superscalar processors. What happens that
although we have a sequential code, at run time what happens, we have different
functional units that will execute -- issue the instructions in parallel to think like a
speculative execution [inaudible] and many other optimizations.
The important thing is, at the end, the result of the application, it's the one that the
programmer was devising. So taking into account this idea, we are trying to apply it to
the star superscalar programming model. And then what we have is a sequential code,
and our parallelization is based on the tasks. So for this right know we need the
selection of the task.
So the programmer will select from the sequential code which functions are a task, and
for this they will also give the direction of the parameters, if it's an input or an output or a
parameter that's written [inaudible].
From this, at run time, so at execution time, what we built is a task graph that takes into
the account the data dependencies between the different tasks. This means that we
are using the starting information about the direction of the parameters and with the
actual data that's accessed by the tasks, we advise what are the different data
dependencies.
This is done at run time. It's important to understand that this has to be done at run time
because we are following the real data dependencies, the actual data. So here we
have different instances of T1, and the data that is writing this T1 is different off this T1,
okay?
We apply also the [inaudible] renaming, for example, to reduce the dependencies by -renaming is like replicating data. So what we do is eliminate the false data
dependencies and other optimizations.
Then when we have this task graph, we don't wait to have all the task graph. When we
have a part of this task graph, we start scheduling the different tasks on the parallel
platform. This can be a multicore, it can be a cluster, it can be [inaudible]. We apply
the same idea to the current parallel platforms. So we do the scheduling of the tasks. If
necessary, we will program also the different data transfers. We will try to exploit things
like data locality. We apply things like prescheduling to bring prescheduling or
perfection of data in order to have the data before the tasks really start and other things.
So then also we will need, whenever a task has finished, that it is notified to the main
problem to the run time in order to change the state, update the data dependence
graph, and continue the evolution. Of course, we cannot start a task until the
dependencies -- the data dependencies have been solved.
So we have different instances of this programming model. We started with a grid a few
years ago, like six or seven, and we have stable versions of grid superscalar that are
running on [inaudible] on our supercomputer in Barcelona. It's used for production runs
for applications that are not inherently parallel, like you can have work flows or
[inaudible] parallel applications that a lot of our life science users have, and other users.
COMPS superscalar is the evolution of these. We'll see what are the future of COMPS
superscalar. And we have another family that are more based on -- use more of the
compiler technology that are tailored for some homogenous multicores, SMPS
superscalar or Cell superscalar or GPSUS superscalar that are very specific for the Cell
or the GPUs and [inaudible] like for example, Nested superscalar is a hybrid approach
that combines two of these families, SMPS superscalar with Cell superscalar, having
two level of tasks. So we have a smaller task inside bigger task. The smaller tasks are
run in the [inaudible] of the cells and the bigger run on the SMP.
So I will explain now COMPS superscalar. COMPS superscalar is a reengineering of
grid superscalar that we did on the [inaudible] project in order to use the grid component
model that was designed and made to be implemented by this project. So with you
used java's programming language, so it's basically everything in Java for this version.
We used ProActive on the lane middleware to build the run time as a componentized
application. And we used, for example, Java GAT or [inaudible] as the middleware for
job submission and file transfer. This case will be for the grid, for example.
So this is how we looked. A COMPS superscalar operation, it's important to note that
the code itself of the application, the objective is that we don't have to change it. It can
remain a regular Java code without any need to add any code to any API or any specific
change.
The constraint that we put on the application is that the pieces of the application that
have to become a task has to have a certain granularity if we take into account that we
want to run it in a distributed infrastructure and we need [inaudible] to transfer the data
through the internet. So we need a given granularity.
Then what we need is that the programmer -- the selection of the task that is made with
an initial [inaudible] with a Java interface, here what we say, for example, in this case
that genRandom is one of the tasks of the application and it has an input parameter -an output parameter, sorry, f which is a file. Okay? So this is what that is used then.
And we can also give constraints to the task. For example, this will mean that this task
needs to run on a platform that has Linux as an operating system, and we can put all
type of constraint, not only on software, it can be on hardware or memory available or
on other things.
So at run time what happens is that our Java application is -- it's loaded, like custom
loaded, and using the annotated interface by means of Javaassist, we can intercept the
tasks that have been annotated in the Java interface. So when we intercept a task, we
know that this is one of the tasks on the application, and instead of [inaudible] to the
regional code, it will insert a call into the COMPS superscalar run time and then it will -I think there is an omission. Here it is better. So whenever we intercept one of the
tasks, the custom loader will put a call on the API on the run time, and this will be the
one that will add a node in the task graph for the [inaudible] dependencies. And
whenever this task is ready, it will submit to the scheduler in order to put -- to see which
resources we have available that can run this task. And, actually, at the end the job
manager will be the one that will do the [inaudible].
We have also a component that is responsible for all the file management of the data
that is necessity. This is used both by the task analyzer to detect the different data
dependencies and also at the execution time to know where the files are and if there is
a need for data movement or not.
So this is already also using [inaudible], for example, we can be using grids. And here I
was showing an example of a reapplication that we developed and that has been used
by MareNostrum user. It's a -- initially it's a sequential application, but it's very easy to
parallelize because basically what you have is a set of sequences and you want to do
all the queries against a database, so it's not difficult to parallelize.
In our case what we did is we got into a code. First we have a piece of the code that
will split the different sequences and databases, then we have a set of codes to the
HMMER and then we have a set of codes to merge the result of the [inaudible]
HMMERs.
So here is the Java notated interface. You see here for the different tasks that we have
selected, that you have the parameters and then we identify the type of the parameter
and the direction.
So at run time what we will have is that it will be a graph like this one, and then the
different tasks that can run concurrently will be scheduled and run in parallel.
This is a shot that shows a bit of the actual performance in terms of the speed-up.
Although we still would like to push the line a bit above the COMPS superscalar version,
it's good that -- for example, we have seen that it's scaling much better than the MPI
version that was available on MareNostrum.
The important thing, this was then used by the EDI to run a real production run with
seven million and a half of protein sequences and using a big database that they come
from. And it was used by these people without any problem and was very useful to
them to use.
So this was my first point. The second is we have also this other project at the BSC.
The EMOTIVE cloud is -- the idea is to develop middleware for a cloud. It's an open
source project, and the idea is that it can be used for our research but also for other
people's research.
The architecture of EMOTIVE is basically what you see in this picture. We have three
different type of components. We have data infrastructure, basically what the
[inaudible] like a file system, a distributed file system, like an FS or other similar. Then
we have the different components. Basically a [inaudible] monitor and the resource
manager that take care of all the [inaudible] machine lifecycle managements. We
believe they take care of creating the VMs, monitoring the VM's destruction and also the
data management, and we have right now also support for migration of VMs in case of
failure or in case that we -- if we are running a virtual machine and it happens that the
user requires more computation, more memory, whatever, and the actual physical
resource is not able to provide this, then the VM can be migrated to another physical
resource. And we also support checkpointing.
And there is right now three different schedulers implemented for this platform. SERA,
ERA, and GCS. So SERA I will explain a bit more.
SERA is basically composed of two different components, SRLM and ERS. Both of
them use agent technology and semantic technology.
Was it me?
>>: [inaudible].
Rosa Badia: [inaudible]. It wasn't me.
>>: [inaudible].
Rosa Badia: Well, it's still 20-minute for the [inaudible]. Okay.
So now everybody is awake. Even me.
So the important thing here, we have a user that wants to submit a job to our motif. It
will submit that [inaudible] to our SRLM. Then this user will have a set of requirements
on the execution of this task. So we use semantic information about the resources with
a requirement of the user to see which are the resources that can run this task. And
then the ERA will be the actual that we will submit this job to the cloud. This is a
EMOTIVE cloud, so this will be the responsible to asking the [inaudible] the resource
manager to stop the virtual machine, to start the task, et cetera.
So what is the goal? The goal is now try to move COMPS superscalar a top of a motif.
Initially this is very easy because COMPS superscalar initially, when you start, can run
on top of any platform. You just need the IPs that describe the resources that you need
to execute. But then if you have a virtual machine and you get the IP, you can put it into
the description that is read and it runs. So the first step is very easy. We just have to
set up virtual machines on top of a motif and then COMPS superscalar can run tasks on
top of these virtual machines. This is already running and it nicely works.
The next step will be, well, can we take advantage of the fact that EMOTIVE takes into
account the elasticity of the virtual machines and that can take into account the different
resource requirements of the tasks. Of course, we can. So what we want is that a
COMPS superscalar, when it starts execution of the application, interacts with SERA,
the scheduler, and will request which type of virtual machines need. Not virtual
machines that have been established before, but it will say, well, for this task is a
heavily computing task so we need a bigger virtual machine or maybe I see that now my
[inaudible] is very parallel, so I need more virtual machines, so on demand I would like
to ask for more virtual machines on my platform and the other things, or I will see that I
have a deadline or timetable to meet, so this is what we are working now.
The next step. And this is our mission architecture, how we think that we want to evolve
COMPS superscalar. So the idea will be we want the different tasks that can run on the
cloud. This is what we already have seen. It's already working.
The other is that we also want the COMPS superscalar run time to run on the cloud,
and similarly to the task, when we said that you can have different number of virtual
machines running all depending on the requirements of the application, we would like
also to have more COMPS superscalar run times running or less depending on the
number of applications that are running in the cloud.
So for these, then, we'll have that instead of having one COMPS superscalar run time
per application, which is the case now, it will be that we have a server COMPS
superscalar run time. So a COMPS superscalar run time should be able to serve more
than one application at a time. But, of course, if we have a lot of them, then maybe we
would like to have more than one.
This will be offered through web service container. This will be like the satisfactory side.
Then we'll have the application side. This initially was still here. Now we're sort of
having just COMPS superscalar applications running on top of this idea.
However, we've seen -- our experience has been was that the idea will be, well, we
have core services that are simple services that are offered in the cloud, for example,
running, and then we have applications that compose these services and make them a
bigger application. But, also, we'll have that these COMPS superscalar applications can
be offered as a service.
So this is one thing I forgot to tell, by the way. That the idea will be that the COMPS
superscalar run now only half tasks. The work is the task of the application. The idea is
that these tasks can be services also in this next evolution of the run time.
So what we foresee is that well need for this first to have a run time that's able to
compose web services. There is a lot of literature on this. Our objective is not to do
new things on this but to offer the different behavior, the dynamic behavior of the
COMPS superscalar run time that is able to build the task graph dynamically to exploit
this idea together with the composition of web service. And I think this will be different
of what graphing done before.
The other thing that we need is we want a graphical interface to help the programmers
on all this. Now, right now basically the development is writing the Java application in a
regular environment. We would like to have tools for to ease the deployment of the
applications and also tools to ease in general the development of the applications.
So, well, basically I went faster than I thought. To conclude a bit, COMPS superscalar,
it's a platform that enables programming of applications on a wealth of underlying
platforms. So we can have below a grid or we can have a cluster and hopefully we can
have also a cloud.
The idea is to evolve COMPS superscalar by means of using the SERA scaler
EMOTIVE, to be able to use these on top of federated clouds. As an example, I want to
mention that although we've seen the EMOTIVE in an environment of the cloud, from
SERA we can also [inaudible] task not only to virtual machines or [inaudible] but also to
other type of cloud like we demonstrated on the projector.
And, further, we want to evolve COMPS superscalar towards what we call probably
superscalar that enables this composition of services by means of using these graphical
interface to help development of applications and also to evolve the run time to enable
all these new requirements on the service composition and service invocation.
So an important thing is this is open source. Right now we don't have COMPS
superscalar available in the web yet, but we'll have soon. We have grid superscalar
available and also EMOTIVE cloud is available. In the future evolution of this,
everything, open source.
Thank you.
[applause].
Rosa Badia: Yes?
>>: Question. So when you're talking about deploying an application against some
back-end run time, your system would actually choose the [inaudible].
Rosa Badia: Yeah. Initially, but right now it's in the application, we have this interface
for the constraints, but the idea will be that we can expand this probably using SLAs.
>>: But this is just articulating [inaudible].
Rosa Badia: No, no [inaudible] this one. You have constraints also here, method
constraints. The method constraints establish the requirements on the task. Right now
this can be hardware or software, the requirement, but the idea can be that it can be
service requirement. Also, [inaudible] I said already the task can have an SLA
associated. This SLA, it then can be translated from a high-level SLA to a more
lower-level SLA, but then it's in -- take it into account with an a scaler, a server, and
takes into account the description, the semantic description, of the resources.
>>: Other questions? Okay. Let's thank the speaker. And that was the last talk.
[applause]
Download