Judith Bishop: So good morning, everybody. I'm Judith... session on Cloud Futures. We're delighted to introduce here...

advertisement
Judith Bishop: So good morning, everybody. I'm Judith Bishop, and I'm chairing this
session on Cloud Futures. We're delighted to introduce here Antonio Cisternino and
and he's going to be giving a paper for himself and his colleague, Mark O'Meara
[phonetic] on managing cloud infrastructure. They come from the University of Pisa,
and welcome to them.
[applause]
Antonio Cisternino: Thank you.
So today I'm going to give you a brief introduction to a system we're working on in Pisa
about -- it's called Octopus. It is about mostly managing virtual machines. So it's not
really on cloud but more on how to manage the huge datacenter we heard about in the
last presentation today and also yesterday.
So virtualization is quickly becoming a key element in computing infrastructure because
it allows -- for many reasons. One of them is you can better use your router if you have
one server, you can partition your services into different systems. You can achieve
better usage of your CPU cycles, and so you're going to be more efficient with respect
to energy.
So Octopus has been from a collaboration with [inaudible] raising the vision and the
idea is how to manage a cluster of virtual machines in an efficient way.
So what is about cloud? Cloud is about data. Data is a really important aspect. There
is services. You run services on cloud. It is about heterogenous systems, especially if
you run private clouds, because it should be expected to have heterogenous hardware
over time.
And reliability, which is important, because if you buy services from a service provider,
you must be reliable. It's about computing center. So many of these things are suitable
for virtualization. And as Dan Reed told us this morning, virtualization is often available.
So we have very powerful computer systems that gives us more horsepower than often
is needed, so that's fine.
And let's start with a demo. I know that this is pretty much unusual, but it's difficult also
to define what Octopus is. So Octopus is something like this. So this is a cluster on our
IT system [inaudible] in Pisa, and it is an HP blade system with eight blades, so 32
cores. And so we're able to run up to 32 virtual machines on top of that. Each node
has 8 gig of RAM. It's a fairly variable cluster.
So I log onto the system, let's say, using my account -- oops. And what I get is my
virtual machines. So the idea is that user get access to their own virtual machine and
they're able to manage them from the web.
So I can go here and say new. I want Linux box with one core and one gig of RAM
called hello, and the machine name is [inaudible]. And the password a querty. I'm
going to remove this.
So what's going on? Now we are running on the distributed network of hypervisors
running Microsoft Hyper-V Server, and we are able to interact with, programatically, with
the Hyper-V infrastructure and allocate virtual machine dynamically on nodes.
So here I was having to already create a virtual machine, and here there is a third one,
which is the one I've just created. So under the hood here the operating system is
booting. At the end of the boot the image is configured so that it integrates with the
Octopus database management. So actually it gets into all the data that you need for
managing the virtual machine.
And so every user in this way has access to his virtual machine and can suspend a
virtual machine like I'm doing with this Win 7 box, and I can stop, so unplug the power,
and I can migrate the virtual machine. So, actually, this migration is more a test,
because one of the features of the system is that virtual machines are running on a
cloud because you don't know really where the virtual machine is running.
So actually when I click this bottom, which is used for us for testing, the virtual machine
gets moved lively onto another node. And this is very important for several reasons,
mostly management reasons, because it allows system admins to move virtual machine
around and have access to the physical hardware without interrupting the services.
Moreover, you can also pack computations into single nodes, shut down unused nodes.
So you can also be green, more greener in your power management.
As you can see here, we were able to get this screen shot of the -- okay, something -this is the small one. This is the screen shot of a running virtual machine. So actually
these three dots are the log-in of the [inaudible] server as it is here for this one. And
this virtual machine now is up and running, so you can access.
This is the screen shot. And if I go back, refresh [inaudible] machines, I can obtain the
IP address of my newly created virtual machine and I can log into it. So now I can run
as my user name, [inaudible], and now I can querty. Okay, querty. There we go.
So this is a brand new virtual machine that has been created on the fly, and it's fairly
useful, because if you have to run a lot of computation, you can install images easily
with this stuff. And at the end of the day, you can decide to turn it off -- oh, yes, I want -and delete the virtual machine, so recover all the computing resources needed from the
virtual machine.
So I'm going to unplug and turn off the machine and delete it. Okay.
So this is what Octopus is about. So the nice thing is that actually we've been able to -all the interface you saw that has been inspired by Windows [inaudible] 7 series that is
about to be released is working on standard browser, so you can have also access and
monitor your virtual machine from standard server form and you have access to all the
information you need.
Okay. So what is the architecture of the system? So the system has a number of
storages here that can be single PCs or whatever. You can have many more of them,
not just one. So usually it's up to you to decide your own architecture. You can use
spare desktops or whatever, and then you have computing power which is provided by,
I don't know, 1-unit arrays, desktops, or better infrastructures, blades.
And the Octopus simply is a software that coordinates resources. So actually the older
software we are using, it production quality. I mean, it's Hyper-V -- Microsoft Hyper-V
2008 Server, and we've been able to use all the services you get from it.
Okay. So the software. The infrastructure has been realized using Windows Server
Hyper-Visor 2008, Windows Active Directory, and the Octopus services that we have
developed. And as for the guest operating systems, we have integrated with a system,
which means mostly you have to do a last step at the end of the configuration of a
system to integrate with the Octopus server, gets your IP and everything, and we've
been able to run Linux, which is supported by Hyper-V, Win7, which has been really
[inaudible] because we're starting to run our administration offices on virtualized box,
and Windows Server 2008 and we've recently upgraded to Windows HPC services, so
we've been able to use the Windows HPC [inaudible] to share computation on virtual
machine using this.
So what is the structure of the system? So the system is mostly built on top of standard
interfaces, so these are robust and everything. So the WMI Hyper-V calls, which are
available, shared storage interfaces, and DHCP networking and DNS. So actually if you
have the Windows DNS you can use calls to configure nodes on the fly and have
publicly available virtual machines.
We are also using the tunnelling facilities to tunnel remote desktops and SSH facilities
into a private network. But, anyway, most of this is [inaudible] sites there or the
configuration.
Then there is this Hyper-F library which is available on CodePlex which has been a port
of the [inaudible] calls on top of F Sharp. So actually everything has been implemented
using the F Sharp language. And using this hyper-F you're able to migrate lively. You
can even do hard checkpoints running virtual machine. So if you have really long-term
jobs running, you can hard checkpoint them with [inaudible], which is around one
minute.
And then on top of that there is the Octopus system, which is an interface that you've
seen that is web-based and tries to -- conveys the idea that managed the resources
should be easy, not with a [inaudible] that assigns nodes manually and everything. So
we're ready to try to do better.
So we relied upon a really nice feature of Hyper-V, which is differencing discs. So using
this approach, we've been able to pack into a single image many, many images
because you can put this disc on top of another disc, sort of copy and write. So the
newly virtual machines create requires only the few bits for the operating system to start
and create all the paging file and small differences on the base image that stays the
same for all.
So actually Windows Server 2008 instance costs us around 400 megabytes. And since
the differencing discs can be layered so you can build a differencing disc on top of
another differencing disc, we are were able also to do snapshotting. So actually if you
have a virtual machine and you want to try something on that machine, you can do a
snapshot, have a new instance which is the photography of the current running
machine, do whatever you want, and backtrack if you're not satisfied with the changes
done.
So this is a really old refrain that comes from the list community mostly, which says that
memory management is too important to be left to programmers versus memory
management is too important to be left to the system. So C programmers were against
system automatic memory management, as least programmers were against human
memory management.
At the end of the day, we can say that mostly now we have most garbage collecting, so
automatic memory management. And this is the same, I think, for virtual machines
management, system management, because as the number grows, humans tend to be
fault prone when they manage large numbers of objects. So we need system that takes
care and schedules resources for us.
So management is important. And actually Octopus features three interfaces. So the
web one you just saw, and Sysadms, which will be web interface, and the last one -and F Sharp, sorry, because since everything has been implemented in F Sharp, you
can use.
F Sharp interactive, which allows you to interactively call.
F Sharp functions to manage your virtual machines in a sort of shell. And then you
have programs using F Sharp.
So actually you can manage and schedule your virtual machine as much as you like.
It's up to you to decide the best way to do it.
So we're still working on Octopus. And the next step -- so actually we built the
infrastructure, but the next step will be implement a policy manager. So actually we're
looking for policies to manage your virtual machines such as I want that user can
instantiate at most ten virtual machines that runs at most 15 days and after 15 days they
will be automatically erased, so deleted, and you can recover, or whenever a non-server
virtual machine goes idle, suspend virtual machine so you can recover computing
cycles.
So these are virtual machine status, user based policies and SysAdmin based policies
are the inputs, and the action you can take are you can shut down machines, delete
machines, move machines across nodes. For instance, for tracking computations and
be more power efficient.
So the scheduler will be the most critical part of this, and we are attempting to
implement an intelligence shell. We're trying to implement non-trivial policies and move
machines around, not just for fun but because machines are really expensive.
So we recently performed some green computing computation. We did a nice job in
measuring how much energy power requires a [inaudible] for running. And during this
experiment we run into a pretty interesting fact. So actually these are graphs that
shows you that the average power absorption -- so these are watts, this is the input of
an algorithm -- changes if you are using one, two, three or four cores of a processor.
So actually it turns out that the waste of energy power if you use just one core on a
4-core machine, it's up to 10 percent. So actually it's very expensive having virtual
machines -- having standard machines idling. Until we wait that this wonderful power
programming take over, we still have to deal with a lot of single sequential
programming, and virtual machine are a viable way to have many of them running on a
quad-core machine, up to four, and the benefit you're getting is that you're using better
your power resources.
Okay. So Octopus is a virtual machine scheduler mostly, which is already -- sorry, go
back -- which is mostly product quality base it's just a scheduler. So virtual machines
are accessible through the standard marks of interfaces so you can start using, and
then in the worst case you have old virtual machines and you simply give up in trying to
use the system.
And it mostly eases the management of cloud computing resources, because actually
since we got this system up and running, the number of virtual machines we created
has been incredible. I mean, I never dreamt to do it manually, but actually when I do a
development, I have preconfigured the image with Virtual Studio 2010 and then I create
a differencing disc on top of that, do my development and turn it off and on and create a
freshly new installed machine whenever I need. And we believe that it may contribute
to achieve better usage of computation resources.
So with this, I'm done. And the system is actually open source on
octopus.codeplex.com if you're interested.
Judith Bishop: So we have time for a couple of questions. Questions? Yes?
>>: Excuse me. How many virtual machines can you [inaudible]?
Antonio Cisternino: How many?
>>: How many virtual machines?
Antonio Cisternino: There is no upper bound because actually we use [inaudible] code
that allows you to manage remotely a Windows Server hypervisor, so actually you can
instantiate as many horse machine as you want, configure then onto the scheduler and
the scheduler simply do remote code. So it's up to you to [inaudible] the system, but
there are no implicit bounds to the number.
>>: [inaudible].
Antonio Cisternino: Sorry?
>>: [inaudible].
Antonio Cisternino: No.
>>: [inaudible].
Antonio Cisternino: Yeah, I mean, we look at system like those. So actually the original
goal of Octopus was to build the experimentation setup to do as much scheduling for
power-consumption reasons. So we were interested in laying ground the infrastructure
that was programmable.
>>: [inaudible].
Antonio Cisternino: Yeah. So the idea is that actually with few F Sharp code you can
say for each virtual machine, do something, and so on.
>>: [inaudible].
Antonio Cisternino: Not in this way. We have more access to underlying systems.
Thank you.
[applause].
Judith Bishop: Okay. So we're going to continue this session and we're still in Italy, just
down the road from Pisa, we're going to the University of Bologna and here we have
Fabio Panzieri who's going to tell us about quality of service aware clouds. Thank you.
Fabio Panzieri: Thank you.
Good morning. I'm going to report on an exercise, an experimental exercise. We're
coming out in my department with a couple of colleagues of mine, actually, two former
students, both of them, of mine, on adding quality of service tools and software within
cloud computing environments.
This talk is organized this way. I will firstly motivate why we're doing this and then I'll try
to clarify what we mean by quality of service in cloud computing and in particular what is
the role of the service level agreements in this context and what is the earlier work on
which we base our current research.
Then I will illustrate this architecture we've proposed and I will eventually will tell you
about this experimental evaluation results we have come up with and which is what I'm
mostly interested in discussing with you.
And, finally, I will conclude this talk with highlighting some of the future developments
that we think are relevant in this particular context.
We are all familiar with the notion of cloud computing, particularly after the second day
of this meeting, and so I will not indulge on discussing whatever software as a service or
platform as a service or infrastructure as a service notions. What I think is relevant to
say is that as cloud computing is essentially what is summarized in this line by Ian
Foster [inaudible] that service is delivered on demand to external customers over the
internet, I think it's worth stressing that quality of service is becoming a crucial factor for
the success of cloud computing providers.
If the cloud computing environment does not deliver the expected quality of service,
then the reputation of the quality of service -- sorry, of the cloud computing provider can
be tarnished, and then it can be, of course, financial losses.
The motivation essentially is that. By quality of service in cloud computing environment,
we mean compliance to the service level agreements that an application using a
platform concepted out of cloud computing resources obtains from that particular
infrastructure.
So in this context we usually talk about response time, throughput error rate and
parameters such as this, but there are no financial requirements additionally that can be
considered in assessing quality of service. And this includes scalability or availability.
As far as this particular exercise is concerned, we addressed mostly response time as
the quality of service guarantee we wanted to evaluate. However, what we mentioned
again is that to the best of my knowledge, quality of service in cloud computing is not
yet sufficiently investigated, although we have observed sort of growing interest in both
industrial research communities and academic research communities on this particular
issue of [inaudible] provision.
So we come from the distributed system community, and the basis of our work relates
to a project which recently terminated which had to do with providing quality of service
support in a distributed computing environment constructed out of clustered application
servers. Essentially we tried to reuse the approach we had in that project within this
new context of cloud computing.
In addition, we are looking very carefully at the results that a currently funded project by
the European community called Reservoir that this project is producing. I shall briefly
summarize these two projects just to set you in the context.
The TAPAS objective was that of developing a family of middleware services that could
make Java 2 Enterprise Edition technology cross aware, that is, capable of meeting
SLA requirements, service level agreements.
So to this end, what we did was essentially to extend one particular implementation of
Java 2 Enterprise Edition, which is called JBoss, which is in open source, where the
three principal additional services that are configuration service and monitoring service
and a load-balancing service are incorporated in this platform, and essentially they
manage the platform dynamically. As the load on the platform -- on the application
hosted on the platform augments, the configuration service enters in action and
reconfigures the platform to cope with the augmented load.
And, in contrast, when the load diminishes, then resources added to the platform are
released. So the effort of this particular architecture is to optimize the use of resources
in a distributed computing environment, the same sort of principle which wish to apply to
the cloud computing. And in particular, we do that -- well, I'm sure you know this -- we
do that by adding those services to the architecture that is being proposed within the
context of the Reservoir project.
Reservoir is a project which is led by IBM and looks at providing support for cloud
federations. In addition, one of the further aims of this project is providing
interoperability and business service management.
The architecture proposed which Reservoir is that of a system structure in three
hierarchical levels of abstraction, a top level known as a service manager which is
responsible for deploying the application on the basis of what they call a service
manifest, which is a new version of a service level agreement, and the service manager
is implemented a top of what they call a virtual execution environment manager which is
responsible for coordinating the distributed virtual environment -- virtual execution
environment hosts that are basically the various resources deployed on the single
nodes in a distributed system, in a distributed environment.
At the moment, as far as I know, only the lowest level and the virtual execution
environment manager have been implemented. The project is still ongoing. I think it
will terminate in a couple of years. At least it's due to terminate in a couple of years.
Our approach is to extend that architecture to include -- incorporate in the service
manager those services that we developed for the TAPAS project that is this
middleware that looks configuring the platform, measuring the -- maintaining, controlling
the compliance with the service level agreement established by the application hosted
on the platform and the monitoring of the current execution of the application.
This architecture here has been -- we wanted to evaluate this architecture, basically,
and see whether we've got it right or wrong, and so we examined the architecture, our
architecture, in a scenario which is made out of a pool -- I'm looking for a pointer, but I
can't find it. No, there isn't. We're assessing the architecture in the scenario in which
we have a pool of available, I'm assuming, free, not used, virtual machines that can be
instantiated and executed on demand.
Each virtual machine comes up with a fixed quantity of resources -- CPU, RAM,
storage -- and can execute scalable services on a pay-as-you-go accounting principle
site. In our exercise, what we wanted to do is basically assess the cost of allocating
resources, allocating virtual machines, basically in order to get some indication as to
what would be the better configuration and load-distribution policies we could deploy in
a context such as this one. And in particular, we would like to devise dynamic
configuration policies that do not violate the service level agreement.
Evaluating our architecture in this scenario has turn the out to be quite difficult. On one
end because of the actual complexity of implementing our architecture, which requires
some time and some investment that we didn't have available at the moment which we
went into this exercise, and in addition, we didn't have available an infrastructure -- a
cloud computing infrastructure that we could use for our purposes. So we evaluated our
architecture through a simulation exercise.
And we implemented a principal service system. Using a request generator and a
response generator, we obtained some performance results that we'll show you in a
moment. So basically it is, as I said, an initial exercise that has given -- has provided us
with these results.
With assume that an application is deployed on the platform using some service level
agreement that specifies its own quality of service requirements, and the agreement,
the negotiation, that occurs in practice is that the service level agreement is usually
agreed that it can be violated for a certain percentage of time. So we assume that that
percentage is five percent. So what is called SLA officially equaled 95 percent means
that it is acceptable that the SLA be violated for five percent of the time.
We also assumed that the allocation time of a virtual machine is two seconds. This is a
very short allocation time. It is reported in the literature, there is this paper by
Sotomayor [inaudible] who report that it can reach even 400 seconds, the allocation
time.
But as we were not currently interested in designing policies for bootstrapping virtual
machine machines quickly, rather we wanted to see whether we could stick to a SLA as
far as the response time of the application goes. We assumed that the virtual machines
essentially did not stand by, and they can be configured if the platform on the fly.
So this is the first set of results that we obtained. We have a number of nodes, which is
very limited. It goes from 1 to 3 only in this first exercise. We set the response time to
200 milliseconds. As you can see, the green line indicates the threshold that when it's
reached, the -- when the response time goes over that threshold more than five percent
of the time, then the platform is reconfigured. And the threshold is below the actual
response time negotiating the SLA because we maintain a margin just to guarantee that
the SLA is not violated at all. Usually violating a SLA means penalties for the -economic penalties for the provider.
So in order to make sure that the SLA is not violated and at the same time in order to
prevent the provider from doing over provision of resources, just to stick to the SLA, we
use this threshold which is just a below the actual negotiated response time.
As you can see, we used the load growing up to 90 requests per second with a SLA
limit of 100 requests per second, and the blue line that you see on this slide indicates
when the virtual machines are allocated and released according to the load that is
occurring that's put on the application. This is on the left-hand side graph.
On this right-hand side diagram you see the violation right. And under these particular
circumstances you can see that the five percent SLA efficiency limit is maintained. It's
always below the five percent. The violation rate is always below the five percent,
which is admitted.
Then we thought that the number of virtual machines that we were allocating is really
very small, so we tried -- we augmented the number of machines available up to 13 and
maintained the same requirements as before, that is, the violation rate of five percent
and -- below the five percent and response time below the 200 milliseconds.
Even in this case you can see that the violation rate never goes above the five percent,
which is as expected, and the resources are added to the platform and released when
necessary as the load varies.
Out of this exercise we thought that -- these are just very initial results, as I said -- we
felt that they are -- they appear to us quite encouraging. That simply says that design
approach appears to be adequate.
However, there are a number of problems that remain open that I think are interesting to
investigate, and I'm quite happy I found in this workshop sort of confirmation that -some of the things that you see in these slides were actually written before yesterday
and I found out yesterday were confirmed by the presentations that I heard.
So one thing I wish to point out is that if we go for a dynamic configuration approach,
then when we manage a large number -- if we're going to manage a large number of
virtual machines, we may hit a scalability problem. Not necessarily in the platform, but
in collateral subsystems. If the virtual machines share a database, the database itself
might become a bottleneck if you have a large number of these machines.
Of course, one can replicate the database and distribute the load across the database,
but then you hit the problem maintaining coherence amongst the various replicas. So I
think this is a problem that deserves attention and further investigation.
In addition, this observation that we made and I guess we shared, the virtual machine
allocation time can be very high and may lead to SLA violations. So this goes to say
simply that it's necessary to investigate virtual machine management and allocation
policies that can prevent from violating possible service level agreements.
So we are planning to do further testing of our architecture using a real cloud as a test
bed. One of the candidates is Open Nebula, which is a cloud in open source, but we
hope that maybe after this workshop even Microsoft Azure might become available for
some experiments.
One of the things we would like to look at is extending the range of quality of service
requirements we wish to consider, and in particular we would like to address
dependability requirements such as fault tolerance or security. I think that an analytical
modeling of cloud computation environments and consequently of our architecture will
be extremely useful to understand where and to invest in order to ameliorate and to
make better progress in the quality of service aware cloud computing environments.
And we would like to look at what issues of cloud federations, and in that particular
context, we think that will become particularly relevant issues of trust, trust
management, and as was pointed out yesterday as well in one of the invited keynote
presentations, we also think that the integration of cloud computing with mobile devices
and services is one of these very challenging scenarios that is worth investigating.
That ends my talk.
Judith Bishop: Well, thank you very much for that interesting talk, including a lot of
performance figures which I think are important for all of us to know about.
Questions?
Yes?
>>: There scaling up and down your VM applications, do you this for increments of one
machine? And do you think it would be useful to do it more aggressively or would
other -Fabio Panzieri: What do you mean by more aggressively?
>>: For example, if you get really large load spikes and you really need, let's say, to
increase the number of machines [inaudible] and a view of the load. Is that something
that you're looking into?
Fabio Panzieri: Yes. In fact, it is something that we did in the other project as part of
the exercise we did before when we implemented the de-certification service. We did
augment the number of nodes that were brought into the platform in much larger
numbers than that.
>>: So if you look at it for [inaudible] offerings, you have a lot of different instance types
you can acquire, so are you also looking into those performance modeling or [inaudible].
Fabio Panzieri: This is what we would like to do if we can find the people that can
actually work on that.
>>: [inaudible].
Fabio Panzieri: I think it's one of the very interesting topics.
Judith Bishop: Other questions? Okay. Great. Well, we'll thank our speaker, then,
and -[applause]
Judith Bishop: We're going on to the next talk, which is a change in the program. The
next speaker is actually Sarunas Girgzijauskas from -Sarunas Girdzijauskas: It's a very complicated name, I know. Girdzijauskas.
Judith Bishop: There you go -- from Sweden. So if you want to switch around, now,
your chance to move.
Okay. Welcome to the third speaker in the session on systems and infrastructure. We
have Sarunas from the Swedish Institute of the Computer Science in Stockholm, and
he's going to talk to us about cognitive publish and subscribe for heterogenous clouds.
Over to you.
Sarunas Girdzijauskas: Thank you.
So my talk will be a little bit different than what we heard before because that's only the
paper, current paper, that we're working on in our institute with my Ph.D. student
Fatemeh Rahimian, who did most of the work in simulating and experimenting with the
system.
So I'll be not as broad, but we'll go a bit to the most specific problems of this cognitive
publish/subscribe systems which we use for heterogenous clouds.
Okay. So what is the future of the clouds? And although we saw these last two days
that the future is for Microsoft, Amazon, Google, we have a bit of different thoughts.
Maybe part of the future of the clouds will belong to the decentralized architecture. It's
no doubt surprising because imagine how many good computing devices we have with
us -- laptops, phones, iPads, whatever -- and what's more important, they're getting
better and better connected with each other. And that brings new media of possibilities
where we could come with our devices and to ask cloud for -- ask this cloud for
resources, but also can contribute our idle resources to the cloud.
So we think that in the future there will be this network collection of connected devices
forming various microclouds which will be extremely heterogenous with very different
computational capacities, different bandwidth lengths, different costs between them,
and that we'll have to take into account.
So one of the main building blocks for such decentralized architecture is a
publish/subscribe service, and this is a very broad concept, you can imagine, that the
users would like to subscribe to their favorite TV channels for IP TV or maybe scientists
would like to subscribe to certain large Hadron Collider data. It's a very, very broad
concept, but it has to work.
And it has to -- the whole system, if you want to embed it on top of this heterogenous
cloud, has to adapt to the topology, adapt to the existing connectivity to use the
cheapest path. And not only that, it has to be cognitive and to understand what are the
user patterns. Of course, probably people or users who watch TV will have different
patterns than scientists who are subscribing for some scientific data.
So our focus is on the publish/subscribe system which has a very large number of
nodes, very large topics on which you could subscribe, and resides in the heterogenous
environments. And we assume our -- basically we propose our solution to these cases
where central solutions will not scale.
Of course, there are many tradeoffs to consider, and you can say why not simply make
an overlay for every topic? If the user wants to subscribe to a certain channel, we make
overlay for every user who this is topic.
However, in this case we might be not very scalable because with the growing number
of interests per user, you will need to have unlimited bandwidth or unlimited neighbor
set, and in this case we might end up scaling badly.
Other extreme is to simply flood or some kind of flood the events or the data through the
system, but then we have a problem that there will be many relay nodes involved which
will not be able or will not want to cooperate because remember that the amounts of
data grow very fast, and if you have to relay some gigabytes of videos which are not in
[inaudible], especially if you might need to pay for your provider or something, you might
be considering that.
So we have to take into account these issues together with the dissemination delay,
how fast you can get the notification when the publisher issues, and, of course, what is
the cost to get it. And that's, I guess, very important for the internet providers, because
so far most of -- I guess here as well, you pay flat rate for the use of the internet, but the
internet providers, they actual costs assessed with each of their networks. So this also
has to be taken into account.
So we will try to make a cognitive publish/subscribe system which would have a fixed
bandwidth, fixed node degree. We will not let it grow indefinitely. However, we would
like to scale to any number of nodes or any number of topics and to understand -basically to let the system in decentralized fashion without any global control to discover
what is the underlying topology to find the most cheap paths in terms of bandwidth and
cost and to minimize the number of these relay nodes using subscription correlation
patterns, because it is known that users do tend to have correlated interests.
If you look from the bigger perspective, the work that we've done is if we see these
heterogenous clouds with this physical network with specific properties, we'll try to build
on top of this cognitive overlay, which would be -- would take into account all these
properties that we want, and then using the connectivity of this overlay, we would build
efficient dissemination structures for each topic in the sense we use only the links of this
overlay, and this overlay will be scalable by definition, all our dissemination structures
will be scalable as well.
So how do we do it? We employ a very nice Gossiping technique for building these
overlays, and for those who don't know, Gossip is just a very lightweight rebus and
scalable mechanism where peers talk to each other -- talk to only their neighbors in the
local vicinity and exchange information about the world. Usually they exchange the
view of their neighbors, and then this propagation spreads and the peers can start
forming different structures.
So with this type of mechanism, we build our overlay. And it's maybe a bit more
complicated than that, but in a nutshell, it works like that. A node starts with a simple
view of the neighbors, like it has some arbitrary neighbor set, and it meets some other
node. They exchange these views, they merge them together, and then this two-view
set is ordered at each peer with their own preferences.
And these preferences is the most important thing here, because we can use some kind
of ranking function. If we want to do it to cluster peers with similar interests, we might -a peer might rank the peers in a way that it would prefer peers with the most similar
interests or maybe peers which have the cheapest connection to each other.
And if we repeat this -- I forgot to tell -- and once we have it, then every peer cuts this
chunk of the new set with the limited, basically, neighbor set, the one that every peer
assigns to itself, basically that's how we scale, and we're repeating this process in very
short round of steps usually. In logarithm number of steps to the population size, we
can converge to very nice structures where we have clusters of these peers, of these
nodes, which are similar to each other. And this way when they're similar, we can
disseminate any events in these clusters very efficiently and cheaply.
So just to illustrate, so say we have such type of network and there are, say, peers
interested in the red topic, and if we applied this Gossiping in the background and we
have -- as a ranking function, we rank basically the peers with a function that has such a
similarity metric where we take, for example, ratio between the intersection and union of
the subscription sets between two peers, so in that way if the peers have identical
subscriptions, it will have similarity 1; if there's no subscription in common, they will
have zero; if they have some interests, it will be something between 0 and one; and if
every peer does it and leaves for itself only those neighbors with the highest similarity
metric, they will eventually converge to the structure where, for example, the red nodes
will form a cluster, the white nodes will form a cluster, maybe there will be some peers
in there, so in green node it will form a cluster. It might be that they won't form a cluster
because they will only be similar enough, but we do our best with the local knowledge to
cluster them.
Not only that, we can also take into account the link cost. So, for example, if this link is
very expensive or starts to be very expensive, this background mechanism of Gossiping
can rewire it and will point the green topic to point to the other peer which is maybe
cheaper.
We can also, with the same mechanism, cluster peer not only by the link cost to enter
similarity but also cluster all of those topics or prefer clustering those topics which have
a very high publication rate, they're very popular, because we would like to be more
efficient for those topics which are expensive to very popular and those who are not that
popular maybe you can deal with it somehow differently.
And let's not forget always we keep the degree of the graph or basically the bandwidth
for every peer limited. So every peer can decide for itself, and we don't need to grow
laterally with the number of topics that we're interested in.
Now, there are problems with that. Because we have this limited node degree, it will be
inevitable that some clusters will be disjointed. They will be connected through some
other peers, but just because of this that we have a limit, they will be disjointed. And if
you want to publish events on that, we will have to ask for the relay traffic for some
other nodes, but we have to do it very carefully and to involve as few as possible of
them.
Now, how to do it. So basically at this point most of the similar approaches stop
because they either do it -- either they're cluster and then say that, well, you know,
because of the correlation we might expect that our bandwidth will not grow very high,
but they don't give any guarantees. Others just involve many peers in between.
Well, we do it a bit differently. Having this over the graph, which there is a nice
clustering phenomenon inside, we embedded into that identifier space, and in that way
we make navigation in this graph possible where any peer can find another peer just by
grid routing. We usually build a navigable small world network for those who are
interested. It's basically the work of Kleinberg, John Kleinberg, who first proposed it.
What's even more nice is that we do it with the same mechanism of Gossiping. We
don't employ another new technique, we just changed the -- oh, sorry, I'll talk about this
a bit later -- but basically for embedding structure, we change just the -- or apply a bit
different ranking function for our Gossiping technique.
And once we have it, when we know how to navigate from any point to another point,
then it's pretty -- well, I'll show, it will be pretty easy to connect these topics and to
publish them efficiently.
So how do we build this navigable searcher with the Gossiping? We assign, for every
peer, a random number, let's say some ID from some identifier space -- it can be one
dimensional or several dimensional -- and adapt the ranking function in a way that it not
only chooses now the neighbors with a similar interest or the cheap cost, as I showed
before, but also prefer, like for one or two links, prefer neighbors which have very similar
ID to their own.
And when you do it recursively after several rounds, every one is connected to the
closest neighbors in terms of the ID and, in a sense, make a ring. And it's not a very
nice ring here, it's a bit twisted, but in general it is a ring, and a ring allows us to
navigate from any node to another node. So we're for sure approaching the target.
There are some other things to consider. We have to -- in order to be more efficient, we
can put these small world style fingers, but I won't go into details but just say that with
this type of technique we can assure that the navigation in the network will be
polylogarithmic in the size of the population.
So basically we can build, with this Gossiping technique, efficient navigation structure.
And once we have all these, now three types of links together, these friend links, let's
say, with you cluster because of similarity the ring links for the navigation and the long
range links for the efficiency of reaching the neighbors, then it becomes pretty easy.
We just say that let's assign for every topic some rendezvous node, and usually it's
easy to have a topic name, and then you can get some ID from this identifier space that
we use and grid a route from every cluster to this rendezvous node and will point a
route and ask for every node that we traversed to be involved in publishing this topic.
We don't create any links. It's the same links which we basically -- the same node
degree doesn't change, but just in that way we can connect these components and we
basically gain -- we retain this clusterization phenomenon in the system while being able
to navigate and connect them.
Yes.
>>: [inaudible].
Sarunas Girdzijauskas: No, actually it gets through this node.
>>: [inaudible].
Sarunas Girdzijauskas: I don't know whether it's a very -- it's just for presentation
purposes, but let's say if you're out from No. 3 and you say if No. 3 knows that 5 is a
rendezvous point, it would say which is the closest neighbor of mine, 35, 28, 20, or
maybe here was I guess some small number, and probably the small number is closest
to 5. And when it comes here, then it will say who is closest to 5 and it will say oh, I
already have 5 and I come here. And every node does the same. So this 4, for
example, also tries to go to 5, and it sees that 8 is the closest, goes here, and reaches
the 5. That's the nice thing about navigable structure which you have.
Because whenever you impose the graph into the identifier space, your initial
connectivity, you can exploit this grid routing, and grid routing is basically minimizing the
distance to the target. And you know the target, you can reach it.
>>:
Sarunas Girdzijauskas: So let's say that topic is disconnected and that topic has some
name. So if you -- if you devise a rule that every peer hashes a topic name and you'll
get as a hash value some kind of number, let's say in this case 5, and everyone does it,
and there is -- everyone can do it independently. So everyone knows independently
that for red topic, if you want to find it, you have to go through 5 -- to 5, and then at 5
they all meet.
There are some particularities. I don't want to go into more details, because you can
have several routes from every cluster in order to make more robust or how we find it.
There are details. But in this short talk I cannot -- if you want, we can talk offline, but
basically we do account for all of it.
So in this case all the topics become connected. And now we can simply flood within
this topic any event, and then we are assured that we will involve only the topics that
are interested and a few other nodes which it is inevitable to involve in order to connect
the joined components.
So since this ongoing work, we don't have the full result so far, but if you're interested, I
have some graphs, but the first experiments are really promising. We were working
with many synthetic data sets which actually required a little bit of work on how to
synthesize the user subscription patterns, which method to use. We also used -- we
[inaudible] a Twitter data set and used the graph there where which people follow which
one, and by that we induced what kind of subscription correlation is embedded there.
For churn we use a -- there is a Skype churn data available on the internet. And our
experiment showed that in some cases if you compare to the existing approaches like
Scribe or Bayeux, those that do have limited, basically, node degree, but they don't take
into account the underlying network, underlying sort of similarity of the peers, and we
have even up to 10 fold increase of -- reduction of relay traffic. We involve much less
number of nodes. And if you think that if this would translate to some gigabytes of data
that you have to transfer, it can be a very big savings.
So to wrap up, we are working on this large scale pub/sub for heterogenous
environments, and the main idea is to make this cognitive overlay where we form the
clusters of similar nodes a decentralized fashion using Gossiping, and by doing that,
this overlay eventually goes through this -- makes the most efficient paths that exist in
the network. So it could be very nice for the internet providers, basically, if they throw
this system, and if you can somehow measure the cost for every link, they would
converge at the least expensive paths to disseminate the data.
And, of course, since this Gossiping is always running in the background, whenever
anything changes, if there is an environment change, we can always adapt and make
new connectivity. And we showed that it is very fast convergence and it is pretty robust
to churning failures.
So with that, I finish, and I thank you very much. I'll be happy to take some questions.
>>: Well, thank you very much for that most unusual talk.
And are there questions? Yes, two there. You first.
>>: I'm a little confused about your use of Gossiping. If you look into the routing
literature [inaudible] usually Gossiping doesn't currently deliver it for information
because [inaudible] how do you deal with that?
Sarunas Girdzijauskas: So, actually, that's why we built this structured overlay on top of
Gossiping, isn't that sense. With Gossiping, we actually build a structure. And even if
you ->>: How do you build a structure if you are not under control? You have a distributed
system.
Sarunas Girdzijauskas: You mean that there can be peers which maliciously drop the
packets?
>>: Yes.
Sarunas Girdzijauskas: Well, okay, this -- at the moment we don't deal with that. So, I
mean, I think -- and this is pretty or orthogonal research which has to deal with -- I
mean, there's a lot of, in the peer-to-peer community, research on how to isolate, how to
overcome the peers which decide not to conform to the existing rules. So in this respect
we assume that the nodes will play according to the rules. If you want to implement into
the real life, of course, you would need to take care of all these issues.
>>: I have a second question. What's your greedy [phonetic] function? Do you have
some greedy routing implemented -Sarunas Girdzijauskas: So once you have the identifier space, when every node has
a ->>: Distance [inaudible].
Sarunas Girdzijauskas: Yeah, exactly. So in that case you minimize the distance
between the identifiers in a sense.
Judith Bishop: There was a question here?
>>: Yeah. So [inaudible] is that a static structure or is it a dynamic structure [inaudible].
Sarunas Girdzijauskas: Exactly. If nothing changes in the system, if the costs are the
same, if the peers don't change their interests patterns or interests, then eventually will
converge and will say the same. But if somebody at any moment in time can say, okay,
I'm not interested in this topic anymore, I am interested there, or there's a new
connectivity, a provider installs a new fiber and then it's cheaper somewhere, then the
structure will change, adapt, and again converge to some state. Of course, in reality,
everything changes all the time so you always adapt. That's why it's adaptive.
>>: Okay. Well, I think we must stop there because our last speaker is ready and
poised.
Download