18132 >> Jim Larus: It's my pleasure today to introduce... going to try the first name, who is graduating from...

>> Jim Larus: It's my pleasure today to introduce Krzy Ostrwoski, I'm not even
going to try the first name, who is graduating from Cornell this year, one of Ken
Berman's students, and has done some very interesting work on programming
abstractions, programming models for large scale distributed systems. And he'll
tell us about this work today.
>> Krzytof Ostrwoski: Thank you. I'm very grateful to be here and to present to
you, and hope that you enjoy my work.
Okay. So let me start by introducing the concept for paradigm agnostic Web.
This has served as a practical motivation for my research. Today's Web
technologies are strongly rooted in the client server paradigm. We assume that
content services always reside on service in the data centers. The clients don't
participate in delivering the service or storing the content and then mostly don't
talk to each other, only talk to the server.
In a paradigm agnostic Web, we would have much more architecture flexibility.
So, for example, the content could be stored only on the clients. Those currently
involving access can form a replicated state machine. Whenever one of them
makes a change, that would be disseminated in a peer-to-peer fashion using
reliable multi-cast, similar technique.
New clients could synchronize with existing ones. So in the extreme case, the
clients might not ever actually need to talk to the server anymore.
Then, of course, in practice we would be most interested in the hybrid techniques
where you have elements of peer-to-peer end clients kinds of interactions where
the server could be used to discover the clients, to self-organize, persist the
content and the last client is about depart and so on.
Now, to show you that this idea makes practical sense, I developed a platform.
Here's a little demo of the platform action. On the left you see a user interactively
composing a shared document. All the links are dragging out, just links to live
Web content. On the right, the user opened the second replica of it. As you see
whatever we do on one replica is instantly propagated to the other one. You can
have multiple such replicas here on both the same machine but they can be in
different places over the network and synchronized across the network using
distributed protocols.
So this is a little bit reminiscent of Google Wave or Chrome OS in the sense that
all the data you see existing in the network and the main difference is that this is
a problem agnostic document. So any type -- any part of it could use any type of
distributed protocol as a way of sharing state. Including peer-to-peer techniques.
Here's another example. This time I have three dimensional shared document
and again everything that the user is dragging is connected to a live data source.
For example, those buildings are connected to multi-cast strange dot carry color
values. And those are used to paint the building's roofs. And the planes are
connected to streams of coordinates that could come, say, from GPS devices
from real airplanes, have a little augmented reality system.
Those multi-cast streams are themselves are dragable. For example, in a
second you'll see when a user opens a plane it extracts multicast stream out of it
and drags and drops it into the window to connect it to camera so he can have all
sorts of personalities like this.
Now, in practice, the choice of the distributed protocol will usually depend on the
type of content. So, for example, in this -- in the list of objects, in a shared
document, could be stored in a Web service. But then the contents of this little
short window could be stored using replication techniques like the list of objects
in space can be used stored replication, but then the coordinates of an integer of
a plane would be delivered using a Gaussian protocol. In a program agnostic
application, an element could use whatever communication and storage
architecture best matches the workload patterns, the code of service
requirements or the characteristic of the environment over which this particular
component was replicated.
Now, there are two types of benefits and challenges in pursuing this idea. On
one hand, by incorporating peer-to-peer techniques into existing applications, we
can potentially improve scalability, latency. We might even lower the resource
footprint of the datacenter because the clients could do some work for us.
Then, of course, the challenge is to maintain predictable quality of service, and
security in a peer-to-peer setting can be harder.
On the other hand, if we managed to somehow seamlessly incorporate all sorts
of distributed protocols into the modern Web development tool chain, then we
can reap the productivity benefits that this modern tools provide in building
arbitrary types of distributed systems. Not only those that are based on the client
server paradigm.
And so what I'm personally most interested in and what I'd like to focus on in this
talk is how we can generalize the existing Web development tools, software
engineering practices, languages, to help us manage complexity in this more
general distributed system setting.
I think that the main source of complexity in distributed systems is what I call an
end point-centric perspective. Today, we generally describe the behavior of
systems in terms of what happens at an individual node, a single communication
end point which we often model as a finite automaton in terms of low level
primitives such as receive a message, send a message, transition state, fire time
out and so on.
If we program at this level, most systems are very hard to express even if you
have a cutting edge programming language overload as you see in this example.
For PACSIS you would need 40 rules like the one shown at the bottom. And
that's true even for a version of the protocol that's described in the research
paper and, of course, in the real world we have much more complexity to worry
about. The Internet is heterogeneous. Consider this example set of eight users
and two servers. And suppose they're all involved in either accessing or
replicating storing the content of that little chart window.
And suppose they want to coordinate their [indiscernible] using reliable
multi-cast. Now perhaps these four clients are on a campus network so they can
locally take advantage of IP multi-cast. There's three guys on a wireless
network, so they could ideally use photo error correction, and those guys on the
right would be in a data center, so maybe they have some super fast enterprise
message pass available. And across the three domains the peer-to-peer overlay
would be the most efficient communication architecture.
Now, it's clear that in order to make the best use of resources in real systems, we
should be able to build those systems in a modular fashion by combining
protocols, different kinds of protocols deployed in different regions into a single
unified substrate.
And doing this in the end point-centric perspective is more difficult, because
when you look at this picture, it is obviously evident that there are some clerical
relationships and those become obscured once we translate everything into C++.
So another question begins, so how should we structure our language to make it
easier to work with such relationships?
But if you look at this picture, it is clear that these three regions in the middle,
blue and green, serve as components.
Every region recursively delegates some work to the regions contained in it.
The protocol that transmits in each region is like the code that implements that
region. So you can think of it almost like a recursive function calls or recursive
data structure except these regions are not objects or functions in a formal sense
in a particular language.
Intuitively, it's also clear that we could replace a protocol integer, for example,
take advantage of IP multi-cast in the green region without necessarily changing
the semantics of the entire system.
And that would be intuitively a little bit like replacing a member field in a data
structure with some other object that implements the same interface in a slightly
different but equivalent manner.
So we would like to be able to take advantage of those intuitions. And in the next
section I'm going to describe a programming model that was inspired by this
>>: What do you mean by recursion there?
>> Krzytof Ostrwoski: In the sense that, and the region has some work to do.
For example, recover from losses. And if you look at this structure, and the
region contained in it, has essentially the same work to do. In a sense, it's of the
same type of work is being done and that can continue recursively.
In my model, an object is formally defined as sort of software components
distributed across the network. The components that make up an object are
called replicas. So here we have an example with three years of accessing
share document. Two objects. Replicas of the application object are instances
of the user interface that we displace the document on the screen and then the
replicas of the multi-cast objects are instances of reliable multi-cast object used
to disseminate the updates between the clients.
And in place of replicas located on the same machine communicated by methods
and call-backs which we can model abstractly as events.
By the way, an object could have infinitely many replicas. Formally it contains all
the replicas that exist right now or will exist in the future. Now, let me pull the two
objects to the sides.
Now, all the events they exchange are shown as green arrows in the middle.
Formally the set of all events of the same type flowing between a pair of objects
is what I call a distributed data flow or flow for short.
So, for example, the send flow will be the set of all multi-cast requests generated
by the application. Instead of all those green arrows, all the data.
And just like objects, such flows could span over infinite time and infinitely many
locations since formally flow contains all the events that are occurring right now,
occurred in the past or will occur sometime later.
Now, with these definitions we can represent the architecture of our system in a
way that resembles a petranet. This new flow you see at the bottom is the set of
all call-backs that carry the received updates from the multi-cast layer into the
And now we can think of the multi-cast object as a set of a function that
transforms one input argument the same flow into one output argument, the
receive flow. Almost like in Java except that where it's in Java we would
transform integers, beliefs and other primitive values, to here we're transforming
the entire distributed flows. Here's a possible signature for the multi-cast object
in my language. It resembles a function definition with the input flows listed in
parentheses and the output flows following a column.
The name of a flow is always preceded by tab, just like you can see. The type of
the flow determines the types of values that are current in each event. So here
we assumed that the updates would be representing the strengths. So that's the
The type also contains optional behavioral constraints. So, for example, receive
was declared to be consistent and fresh.
Those behavior constraints tell us something about the special and temporal
patterns of events in the flow. And to give you a taste of what this may mean,
formally we model each event as a quadra pole, location and time where it
occurs, version K and value V. For example, in receive flow the version might be
the sequence number of the receive update and the value would be the payload,
the content.
With this formalism, now we can define different behavior constraints by relating
pairs of events that occur in different times and places.
For example, I say that the flow is consistent if any pair of events no matter
where or when they occur if they have the same version they also have the same
It makes sense that the receive flow would be consistent since you wouldn't want
two completely different updates delivered with the same sequence number.
In general, the consistent flows in my model will usually carry information about
various decisions that have been somehow made in a coordinated manner in the
protocol. A special kind of consistent flow is a monotonic flow where a larger
version implies a larger value. As you will see later, monotonic flows carry
information about decisions that have been somehow remembered by the
system. Persistently.
I'll explain what that means later. And more of such constraints are defined in the
paper. If time permits, I will show you how we can use this to drive compiler
Are there any questions at this point?
>>: What is fresh?
>> Krzytof Ostrwoski: Fresh. It basically means that nobody is lagging behind.
If somewhere in the network a certain version of an event with a certain version
appeared, then on every correct node at least that or higher version will also
Now, meanwhile, let me explain how we can model larger systems using this
approach. So here's an extended version of our previous example. Suppose
that multi-cast protocol now has modular structure. First it disseminates updates
using some unreliable method. That's taken care of by component U. Then
there's lots of recovery, that's component C. I can model the groups as two new
objects, valid and green. Likewise, suppose that the replicas of the loss recovery
object self-organizes using a membership information that's obtained from some
cloud service and we can identify the protocol for contacting that service and
modeled as this new orange object in the middle and so on.
And you end up with a sort of extended diagram in which different objects span
across different sets of locations.
And from now I'm going to always represent flows as those yellow sandwiches
here. Now, if you look at this diagram, you see that flows in this model serve as
typed contracts that decouple objects from one another, very much the same
way as type interface objects decouple in Web services in an ordinary work flow.
Now this diagram may be useful as a way of modeling systems but it's not an
executable program. What I'm going to do now is I can zoom in on portions of
this diagram and recursively decompose each object into smaller parts, to the
moment where everything in the diagram is expressed in terms of a handful of
basic building blocks.
Let me zoom in on loss recovery as an example. Now, my contract says that
there's single input flow carrying into the object information about what updates
have been received at the different nodes across the system. And my job is to
generate for decisions and deciding when updates are stable everywhere.
I'm going to model each update. I'm sorry. I'm going to model each event as
carrying a set value, set of integers, where each integer represents a sequence
number of one update.
Here in those two events, suppose those two events appear in the correct flow,
the replica of loss recovery machine Bob, means that Bob just has the update
from one to two and from four to seven, and the replica on machine Alice, Alice
has all the updates from one to three and eight. Now, by the way, I'm going to
always assume here in and all parts of the model, whenever a flow is carrying a
type of information, every event carries complete information.
So here every event will contain all the identifiers of all the updates that have
been received even if they've been reported once.
Doing this little trick makes it much easier for me to model about progress in the
system. I can always express progress as monotonicity as I'll explain later. And
this is something that can be very easily taken care of by the compiler so the
user doesn't have to bother about it.
Okay. So we have those set values in the object now on different replicas. One
thing, as I mentioned, loss have recovery have to do is tell us when updates are
stable across the system so we can do the cleanup.
We're having that information, we can just calculate such intersection. The result
would be one and two, and that's what should appear as events in the stable
So this is very simple. And it's actually one of the building blocks in the model.
On aggregation, can have different flavors like the particular kind used here is
described by the blue keywords at the bottom. Now I'll explain those in a minute.
Meanwhile, let me just give you some intuition how we can model other parts of
this object. So suppose that we aggregate events in the right flow using a certain
union operator. Well, that will produce the sort of all updates that are present
anywhere in the system, because we'll add up all different sets of identifiers that
everybody has reported. And now by subtracting those values from the life flow,
we can calculate the flow of information about which updates are missing on
which nodes. You see how in this fashion we can successively build up by
applying the applications build up for one request that can be carried out of the
And now if you compare this representation of a distributed protocol where the
code in C++ or even rules in overlog, clearly this is a much higher level, we can
clearly see what kinds of information are involved in the protocol, how the
protocol makes decisions, those relationships are presented in a way without
cluttering them with various low level details, communication details and other
such. At the same time this is a constructive representation because we have a
very small number of those building blocks we know how to implement them.
And this is also something we can formally reason about.
I'm not going to present too many formulas in this talk, but most of it is described
in the paper. Aggregation is a transformation on flows where for every event in
the outflow beta shown on the right here we have some finite set of events in the
input flow alpha. That happened earlier in such that the value here in this event
is an aggregate of the values in those events. So that's important. Now the
different kinds of aggregations, the different physical protocols or different formal
flavors of aggregation will of course differ in the way we select those input
For the kinds of aggregation that we care about, we can characterize this
selection in a slightly more explicit way. And without going too much into detail
we can characterize aggregation by two families of functions: Memberships and
Intuitively, for every event in the output flow, memberships first tell us what are
the locations that participated in the aggregation from this event was generated,
what codes have been communicated to generate that event.
Then for each of those locations the selectors tell us what was the version of the
event that this particular location has contributed? And since this model,
allocation and inversion determines the value uniquely, that's all the information I
need to express, completely write out all the values in the output flow in terms of
the values in the input flow shown by the formula at the top.
Details aren't that important. But what I want to point out is now having such
formal characterization, I can distinguish different flavors of aggregation by
placing various constraints on the memberships and selectors.
So, for example, I'll say an aggregation is in order if the players are monotonic.
Intuitively basically means whenever I aggregate I'm using all the latest
information I've got or a complete aggregation will be one in which once a
nonfaulty mode makes it in into the membership, it will be in every membership
with a higher version. Means nobody is left out without good reason, and so on.
Now, with all those formal decisions, I can now reason about the object flow
diagram. So about the systems behaviors abstractly without having to worry
about the specifics of the particular protocol. I only need to know what flavor of
aggregation a given protocol implements to know how it transforms my flows.
And theorems like the one shown on the bottom that are much so at the paper
basically become rules of type checking.
For example, here if I know that the input flow is of a certain type, I think this is
weekly monotonic, and aggregation is guarded, coordinated and so on, then I
know that the output flow is strongly monotonic. I can apply such theorem as a
type checking rule for every edge in my object flow diagram, by doing so, for all
the edges I can convince myself that the system has the sort of behavior I
wanted it to have.
>>: I'd like to ask you a question, so you could apply this kind of rule to the
abstract diagrams that you had previously. But can you infer these properties
from [indiscernible] implement or could you check that it implemented these
properties correctly?
>> Krzytof Ostrwoski: Presumably you could in the future do that. Never the
proposition from the code. I don't know how to do it. But I envision -- it might
require some assistance. Maybe annotations in the language or special
language features to infer those.
But I believe that's possible.
Okay. Now aggregation is a basic building block in many systems. For example,
in map [indiscernible] data framework, like trial link and so on, the kind of
aggregation that we need here to construct distributed protocols as a building
block is a little different.
In particular, one especially important kind of aggregation that's used to express
almost anything has a notion of memory. And let me explain what it means using
an example. So suppose we have three replicas of aggregation object
consuming those value lids shown on the bottom and calculating certain
intersection and producing valid at the top. Transforming flow arrived into flow
state as I described before.
Now suppose the machine on which the value is calculated suddenly crashes.
Meanwhile some other node arrives and joins the system, participates in the very
next aggregation.
And suppose that this node is totally new. It hasn't caught up with everyone else,
hasn't received any update. So the value that appears locally in the right flow
reported by the other layer in the system is an empty set. No updates.
And that, of course, causes the result of the aggregation to also be empty. And
then we have a problem, because previously we reported that items two and
three are stable. Maybe some other part of the system already acted upon that
information. And now we changed our mind.
But if something stable should have remained stable forever. So this is
In order to use aggregation object in our system and the object needs to
somehow remember what it has reported in the past and make sure that
whatever it reports in the future is consistent with the past report.
Now, if you recall, every event in this model has a version. And if a flow is
monotonic, a higher version implies a larger value. So that would be set
inclusion in this case. So if I know that the result of the newer aggregation after
the crash is tagged with a higher version number and if I know that my protocol
produces a monotonic flow, then I know that this problem will not occur.
And the theorems like the ones I've shown you in one of the previous slides
actually can help us to prove that certain flows are monotonic. And help us
understand exactly how we need to construct our aggregation protocols to
produce the desired behavior.
And this particular kind of aggregation, the monotonic aggregation is very easy to
implement if you have access to a membership service, which was the case in
our previous example, if you remember.
Now, one aspect of monotonic aggregation that's quite unique that I want to
elaborate on is the possibility of implementing in a scaleable manner on a
heterogeneous system without needing a single global membership service.
Consider this example group of ten nodes across three of them in these
administrative domains. I'm going to align them at the top there. And in the
middle I'm going to draw the components that run on them. So every vertical
segment of the screen will be the sort of components running on one of the
Now, suppose that on a campus there is an aggregation object. Every node on
the campus network has a replica of it and those replicas self organize, using a
local membership service that's deployed to manage the campus.
I'm going to assume that on every administrative domain bigger or smaller there's
one such membership service deployed to manage it and that there's some
mechanism in place that allows a node to automatically find what is the locally
deployed service and register with it.
Now, consider a part of the right flow confined to the campus. This is shown
here. This flow is transformed by the campus membership -- I'm sorry, by the
campus aggregation object to produce that flow at the top, information about
which bits are stable across the campus.
So we have one input flow one outflow object and no matter how many times the
nodes on the campus crash and the recovery we still have a single output flow
with a single property. Monotonic.
Now, suppose the same happens in the other two domains. On the airport and in
the clouds. So in each domain nodes will locally self-organize using a local
membership service. This one and this one.
And form a single aggregation object which will transform some part of the right
flow to produce a single output flow of information about what is stable across
that domain.
Let's say we have three stable flows. And now further suppose that in every
domain a selected leader node and that leader might change actually
dynamically, doesn't really matter. A selected leader passes this local
information further to a higher level in our infrastructure. So we're going to have
such a hierarchical implementation.
These three leaders then use membership service deployed for a wide area
network to self-organize into a single yellow object, aggregation object of the
same flavor. By the way, here each of those four objects can now be
implemented in a completely different manner using whatever mechanisms are
the most appropriate for that particular domain.
The only thing I need to assume all these aggregations are of the same flavor.
This is the part where we can introduce heterogenicity into the architecture. I'm
going to refer to every heterogeneous aggregations hierarchy as aggregation
So now the input flow for the yellow object is the union of the output flows from all
the domains below it. And the output flow is information about what is stable
across the entire system.
Now using our approach, I can model this entire dynamically reconfiguring
hierarchy using a single object flow diagram as I've explained before and every
input flow is the union of the output flows of the aggregation below it.
And this diagram is very simple. We have a very small system with only three
domains now. In general, of course, nodes might fail, reboot, new nodes might
arrive. Those new nodes might reside in different administrative domains. So
entire new domains might be introduced to this hierarchy, new levels in the
hierarchy might be created and so on.
We can still model such a dynamic arbitrarily nested hierarchy as an arbitrary
flow diagram except this diagram may be infinite. It may be infinitely tall and
every element in it can have infinitely many children.
And those infinitely many children come from the fact that if new nodes keep
arriving from new domains they get appended in the hierarchy part of the
infrastructure that's where the infinite children come from.
Fortunately I have proved that any such hierarchy instructed out of monotonic
aggregations potentially different protocols, each of which implements monotonic
aggregation. In fact, we have as if it were a single global system monotonic
That's good news because that means that whenever I need to use monotonic
aggregation as a building block in my system, as was the case in, for example,
loss recovery protocol, you've seen earlier, I can always implement the building
block in a hierarchical manner using the approach I just outlined as a hierarchical
aggregation network.
And it turns out that monotonic aggregation, the same building block you've seen,
can be used to express many commonly used protocols such as distributed
locking, leader election, various forms of agreement. It seems that we should
also be able to model the classical consensus protocol in the portion although we
have to work out some details. So potentially for a very wide range of systems
we can implement them in a scaleable heterogeneous manner by following this
>>: Seems like if you had a cloud, let's say didn't want to be involved in all the
peer-to-peer transfers, all the high bandwidth streams, but it was just sitting there
to do just ordering, basically, caches and maybe to decide what is the -- wouldn't
that make this problem a lot simpler, because it would determine the order. It
would determine what is stable.
>> Krzytof Ostrwoski: Right. You could use that. You could use cloud only to
help you manage membership or you could use cloud for ordering. Of course,
then, implementation of each aggregation object would be much simpler.
>>: Why wouldn't you want to do that?
>> Krzytof Ostrwoski: Potentially sometimes you may want to do work at the
client side, right? Because maybe, I don't know, some group of clients is on a
network but they can really very efficiently talk to each other, and it is just faster
for you to organize this communication in a peer-to-peer fashion, coordination in
a peer-to-peer fashion.
>>: So can you give me an example of a case where you couldn't wait for the
server to time [indiscernible] as ->> Krzytof Ostrwoski: Well, one classical application where people use this with
the protocols is, for example, the Wall Street.
So in those applications you might want to do that. There are also many
scenarios. Actually, our funding comes from part of it from the Air Force. There
are a lot of scenarios in which just most of the time you don't have uplink
connectivity to the datacenter. And they just want to run your collaboration. In
fact, they want to have Web applications without having the Web, right?
And these are the types of applications where you're forced to use peer-to-peer
approaches and act just like this.
Okay. Now just to conclude this section, let me show you a little bit of a
performance result. This is a simulation of a system built using the methods that
I described earlier from the model. It's a simulation, but it's realistic, so we have
actually point-to-point message exchanges, notes from talking rings, those
talking around the ring do aggregation so on. They all independently fail and
reboot. After rebooting they have green state have to reinitialize and so on. The
system size varies from 62 to 32,000 nodes. The meantime to reboot is shown
on the X axis, I'm sorry the meantime to fail. The meantime to reboot is five
And we simulate a simple barrier synchronization protocol where we want nodes
to enter subsequent phases of processing in a coordinated fashion so no two
nodes are one phase apart.
That is expressed as a single monotonic aggregation, and that aggregation is
then implemented as an aggregation network in a hierarchical manner with lots of
services managing the different regions and different layers.
And each aggregation object in the hierarchy aggregates ten times per second,
but then of course it takes time for the information to surface to the root.
So we measure the route to street time. How fast it takes for the system overall
to enter the subsequent phases of processing. And that gives us a estimate of
round trip time on the Y axis.
One thing that we notice is the latency within which the system works is
logarithmic. You see every time we grow by a factor of eight, latency grows by
the same amount. This is consistent with what we would expect theoretically.
The second thing we notice is this structure is quite related to church churn rate
where half the system is reconfigured at a given time. And we can work with 20
percent performance penalty. And that's because every aggregation object in the
hierarchy basically handles its own failures independently. So when a single
node fails, it affects a small system don't have to destabilize the entire network,
as would be the case if you used global membership service across this 32,000
node systems, all these failures would cause global membership changes. You
could batch them, but by the time you start to batch them, it would increase the
response time and you would have failure in a timely fashion and wouldn't be
able to do any work. There are more results in the paper so I don't think I have
that much time.
>>: So this is what -- [indiscernible] aggregation.
>> Krzytof Ostrwoski: Right this is a balanced tree. In the relative you'd have to
work hard to build a balanced tree. That's part that I didn't evaluate in this paper.
Okay. So here's an example of a partial implementation of the loss recovery
object you've seen before. And you will recognize the object signature at the
Tom inside I've explained it before. Now, for the body, one way to think of it is
essentially as a textual representation of this object flow diagram you've seen
before shown below here.
Now for now please disregard the grayed out statement and just focus on those
three highlighted assignments. Every occurrence of an arithmetic operator
anything that looks like a function call in general is an inplace declaration of
embedded object. You have three such occurrences two embedded objects two
of which are aggregations.
We can declare internal flows like we would declare local variables.
By the way, the order of the statements doesn't matter in this program because
these are not one of statements that's run to completion. There's nothing
sequential about this program. Basically, these are all just declarations so
implicitly concurrent, with respect to one another.
And every time we pass some arguments to an operator or consume the result,
we declare a flow dependency. So we declare those black arrows in the
Now, notice that we don't need to speak the type of aggregation. Certain things
are assumed. Most of these properties are taken for granted unless you
specifically say that you don't want them. One of them is implemented by the
grayed out statement. I'll talk about it in a minute.
And finally notice that we never talk about membership here. It's implied in order
to implement certain types of operators, for example monotonic aggregation like
the one used here, you either need to use global membership services to
self-organize or you have to construct the scaleable aggregation network like I
showed you before, then you use lots of membership services for that.
And anyway, there's no need to burden the programmer with any of this. The
program doesn't really care whether you have some membership in the system
as long as all the flows are transformed in the way you want them to.
So we consider this implementation detail that's taken care of by the compiler or
even at the time of deployment. Now, just very briefly what it means to translate
such program, compile such program basically generate an object graph like the
one on the right where the root component is the other replica of the object you're
defining whenever you have some embedded object you create a replica of it
since these are building blocks just call this pregenerated for you.
And the structure of the program determines how events are going to fly around
this infrastructure within each replica. I'm not going to go into detail but basically
you can think of production rule systems like those based on the [indiscernible]
algorithm where you have graphs of elements and every time one of them gets
updated that causes other events to get updated accordingly.
Okay. Now back to our language. Just like in any high level language we have a
conditional statement, the where clause. The argument here in parentheses is
boolean flow. Any type of boolean flow this one is represented by the highlighted
expression. And the way it works is that whenever true event appears in the
boolean flow at my replica, locally the body of the workloads gets activated
whenever a false event appears or when there was no event, the body gets
So in a typical language, control -- conditional statements tell us which chunk of
the code has executed. It's a binary yes or no. Here conditional statement tells
us which subset, which subset of replicas is running that code. So in this case
which subset of replicas is doing aggregation.
To deactivate the body of code basically means to suspend all the local flow of
events. In case of aggregation that means that the local replica will not submit its
own local values from the aggregation, will abstain from the aggregation. So the
meaning of this entire statement here is that the only replicas that will vote over
which updates are stable are replicas that first know the latest value in the stable
flow and have all the updates that are considered stable.
Which makes sense since if some node doesn't have the updates that the
system considers to be stable, then it can't consider itself to be part of the
system, right? It has to abstain from election, from aggregation.
So conditional statements like this can be used to model all sorts of joint
conditions that you have to put the new members through to ensure that they're
synchronized with everyone else.
Okay. For the end let me show you one complete example. Very simple
distributed locking protocol implemented using leader election.
We have a single input flow boolean flow once. A true value on a given location
in that flow means that the particular node wishes to acquire the log and false
volume is the node no longer needs the log.
Now, when an event with a true value appears at the replica this causes this
body to be locally activated, which means that the replica of stable elect on that
node will get activated, which means that we will participate in aggregation, I'll
show you in a second.
So once we communicate our intent to acquire the log, our local unique identifier,
this is just a building flow, gets submitted into that replica. And once we get
elected as a leader, we remain elected as leader. Again I'm going to explain it in
a second.
So once you find out that the leader identifies the same as ours, we can respond
to the application that yes we've got the lock now and we know that we will
continue to be the lock holder until the moment where we just give up ourselves.
And here's the implementation of the scaleable election protocol. So we have
single input flow carrying the identifiers of the candidates for election. We elect a
leader by just picking the minimum identifier of the candidates and report it back.
Now, to ensure that the result of election is stable, as I said we need to, I'm going
to request that if my -- if I know what is -- I can only participate in election if I
know who is the current leader and my own identifier is not smaller than the one
of the leader.
If my identifier was smaller and I participated in the election then I would
overthrow the leader, I would become the leader. I would violate this condition if
I wanted. So this joint condition ensures the new nodes coming to the system
will not cause the change of the elected leader until the moment when the leader
gives up.
And so this is actually a correct protocol. The only problem is that it can lead to
starvation but that can be fixed. I'm not going to go into detail today.
I'm going to skip this slide. Okay. To conclude this talk, let me say a few words
about other works I've done. Essentially everything I did grew in this way or
another from my early work on scaleable reliable multi-cast here.
I've built a is the many in C# that can deliver data close to network speeds and
scale to hundreds of nodes with almost no performance penalty. Actually penalty
that's observed is due to garbage collection overheads that I can control.
But the main challenge in building this system was due to the fact that with
hundreds of nodes, all of them basically at the maximum capacity, CPU, UIs and
network wise, it's very easy for any little event anywhere in the system to struggle
as the entire system, because these events tend to cause cascading problems.
Reactive recovery techniques usually cause more timeouts and those timeouts
cause even more timeouts and so on.
And so most of the time I spent profiling and profiling carefully trying to
understand the source of these instabilities and preventing them.
Another major line of work was working, enabling problems, agnostic Web
applications and that includes the demo you've seen earlier. And that was mostly
P issued on the Web services crowd.
And finally as we developed concurrent obstruction for multiple platforms over
there on the top, it combines the benefits of producing a model. In a nutshell the
idea is a single component in my system can automatically decompose itself into
multiple replicas then those replicas can handle method calls in parallel and as
necessary it can shrink itself back into a single replica become an ordinary
reactor. All this is happening in a way that's completely transparent to the user
calling into the object. And it's almost transparent to the programmer. The
programmer has to implement some method to split a replica in two and merge
two replicas back, but all the coordination, all the synchronization and scheduling
so on is taken care of by the platform.
And so an object implemented this way can actually continue to go through such
phases where it behaves like an ordinary reactor and phases where it gets
replicated, can shrink and grow and a model very similar to map reduce.
Okay. And, well, in this talk I decided to focus on distributed programming
concepts because I felt that would be the most relevant for you. But I'd be happy
to talk about any of these other projects in case you're interested off line. And
that's all I have today. Thank you.
>> Jim Larus: Any questions? Okay. Let's thank Chris.
>> Krzytof Ostrwoski: Thank you.