18132 >> Jim Larus: It's my pleasure today to introduce Krzy Ostrwoski, I'm not even going to try the first name, who is graduating from Cornell this year, one of Ken Berman's students, and has done some very interesting work on programming abstractions, programming models for large scale distributed systems. And he'll tell us about this work today. >> Krzytof Ostrwoski: Thank you. I'm very grateful to be here and to present to you, and hope that you enjoy my work. Okay. So let me start by introducing the concept for paradigm agnostic Web. This has served as a practical motivation for my research. Today's Web technologies are strongly rooted in the client server paradigm. We assume that content services always reside on service in the data centers. The clients don't participate in delivering the service or storing the content and then mostly don't talk to each other, only talk to the server. In a paradigm agnostic Web, we would have much more architecture flexibility. So, for example, the content could be stored only on the clients. Those currently involving access can form a replicated state machine. Whenever one of them makes a change, that would be disseminated in a peer-to-peer fashion using reliable multi-cast, similar technique. New clients could synchronize with existing ones. So in the extreme case, the clients might not ever actually need to talk to the server anymore. Then, of course, in practice we would be most interested in the hybrid techniques where you have elements of peer-to-peer end clients kinds of interactions where the server could be used to discover the clients, to self-organize, persist the content and the last client is about depart and so on. Now, to show you that this idea makes practical sense, I developed a platform. Here's a little demo of the platform action. On the left you see a user interactively composing a shared document. All the links are dragging out, just links to live Web content. On the right, the user opened the second replica of it. As you see whatever we do on one replica is instantly propagated to the other one. You can have multiple such replicas here on both the same machine but they can be in different places over the network and synchronized across the network using distributed protocols. So this is a little bit reminiscent of Google Wave or Chrome OS in the sense that all the data you see existing in the network and the main difference is that this is a problem agnostic document. So any type -- any part of it could use any type of distributed protocol as a way of sharing state. Including peer-to-peer techniques. Here's another example. This time I have three dimensional shared document and again everything that the user is dragging is connected to a live data source. For example, those buildings are connected to multi-cast strange dot carry color values. And those are used to paint the building's roofs. And the planes are connected to streams of coordinates that could come, say, from GPS devices from real airplanes, have a little augmented reality system. Those multi-cast streams are themselves are dragable. For example, in a second you'll see when a user opens a plane it extracts multicast stream out of it and drags and drops it into the window to connect it to camera so he can have all sorts of personalities like this. Now, in practice, the choice of the distributed protocol will usually depend on the type of content. So, for example, in this -- in the list of objects, in a shared document, could be stored in a Web service. But then the contents of this little short window could be stored using replication techniques like the list of objects in space can be used stored replication, but then the coordinates of an integer of a plane would be delivered using a Gaussian protocol. In a program agnostic application, an element could use whatever communication and storage architecture best matches the workload patterns, the code of service requirements or the characteristic of the environment over which this particular component was replicated. Now, there are two types of benefits and challenges in pursuing this idea. On one hand, by incorporating peer-to-peer techniques into existing applications, we can potentially improve scalability, latency. We might even lower the resource footprint of the datacenter because the clients could do some work for us. Then, of course, the challenge is to maintain predictable quality of service, and security in a peer-to-peer setting can be harder. On the other hand, if we managed to somehow seamlessly incorporate all sorts of distributed protocols into the modern Web development tool chain, then we can reap the productivity benefits that this modern tools provide in building arbitrary types of distributed systems. Not only those that are based on the client server paradigm. And so what I'm personally most interested in and what I'd like to focus on in this talk is how we can generalize the existing Web development tools, software engineering practices, languages, to help us manage complexity in this more general distributed system setting. I think that the main source of complexity in distributed systems is what I call an end point-centric perspective. Today, we generally describe the behavior of systems in terms of what happens at an individual node, a single communication end point which we often model as a finite automaton in terms of low level primitives such as receive a message, send a message, transition state, fire time out and so on. If we program at this level, most systems are very hard to express even if you have a cutting edge programming language overload as you see in this example. For PACSIS you would need 40 rules like the one shown at the bottom. And that's true even for a version of the protocol that's described in the research paper and, of course, in the real world we have much more complexity to worry about. The Internet is heterogeneous. Consider this example set of eight users and two servers. And suppose they're all involved in either accessing or replicating storing the content of that little chart window. And suppose they want to coordinate their [indiscernible] using reliable multi-cast. Now perhaps these four clients are on a campus network so they can locally take advantage of IP multi-cast. There's three guys on a wireless network, so they could ideally use photo error correction, and those guys on the right would be in a data center, so maybe they have some super fast enterprise message pass available. And across the three domains the peer-to-peer overlay would be the most efficient communication architecture. Now, it's clear that in order to make the best use of resources in real systems, we should be able to build those systems in a modular fashion by combining protocols, different kinds of protocols deployed in different regions into a single unified substrate. And doing this in the end point-centric perspective is more difficult, because when you look at this picture, it is obviously evident that there are some clerical relationships and those become obscured once we translate everything into C++. So another question begins, so how should we structure our language to make it easier to work with such relationships? But if you look at this picture, it is clear that these three regions in the middle, blue and green, serve as components. Every region recursively delegates some work to the regions contained in it. The protocol that transmits in each region is like the code that implements that region. So you can think of it almost like a recursive function calls or recursive data structure except these regions are not objects or functions in a formal sense in a particular language. Intuitively, it's also clear that we could replace a protocol integer, for example, take advantage of IP multi-cast in the green region without necessarily changing the semantics of the entire system. And that would be intuitively a little bit like replacing a member field in a data structure with some other object that implements the same interface in a slightly different but equivalent manner. So we would like to be able to take advantage of those intuitions. And in the next section I'm going to describe a programming model that was inspired by this idea. >>: What do you mean by recursion there? >> Krzytof Ostrwoski: In the sense that, and the region has some work to do. For example, recover from losses. And if you look at this structure, and the region contained in it, has essentially the same work to do. In a sense, it's of the same type of work is being done and that can continue recursively. In my model, an object is formally defined as sort of software components distributed across the network. The components that make up an object are called replicas. So here we have an example with three years of accessing share document. Two objects. Replicas of the application object are instances of the user interface that we displace the document on the screen and then the replicas of the multi-cast objects are instances of reliable multi-cast object used to disseminate the updates between the clients. And in place of replicas located on the same machine communicated by methods and call-backs which we can model abstractly as events. By the way, an object could have infinitely many replicas. Formally it contains all the replicas that exist right now or will exist in the future. Now, let me pull the two objects to the sides. Now, all the events they exchange are shown as green arrows in the middle. Formally the set of all events of the same type flowing between a pair of objects is what I call a distributed data flow or flow for short. So, for example, the send flow will be the set of all multi-cast requests generated by the application. Instead of all those green arrows, all the data. And just like objects, such flows could span over infinite time and infinitely many locations since formally flow contains all the events that are occurring right now, occurred in the past or will occur sometime later. Now, with these definitions we can represent the architecture of our system in a way that resembles a petranet. This new flow you see at the bottom is the set of all call-backs that carry the received updates from the multi-cast layer into the application. And now we can think of the multi-cast object as a set of a function that transforms one input argument the same flow into one output argument, the receive flow. Almost like in Java except that where it's in Java we would transform integers, beliefs and other primitive values, to here we're transforming the entire distributed flows. Here's a possible signature for the multi-cast object in my language. It resembles a function definition with the input flows listed in parentheses and the output flows following a column. The name of a flow is always preceded by tab, just like you can see. The type of the flow determines the types of values that are current in each event. So here we assumed that the updates would be representing the strengths. So that's the type. The type also contains optional behavioral constraints. So, for example, receive was declared to be consistent and fresh. Those behavior constraints tell us something about the special and temporal patterns of events in the flow. And to give you a taste of what this may mean, formally we model each event as a quadra pole, location and time where it occurs, version K and value V. For example, in receive flow the version might be the sequence number of the receive update and the value would be the payload, the content. With this formalism, now we can define different behavior constraints by relating pairs of events that occur in different times and places. For example, I say that the flow is consistent if any pair of events no matter where or when they occur if they have the same version they also have the same value. It makes sense that the receive flow would be consistent since you wouldn't want two completely different updates delivered with the same sequence number. In general, the consistent flows in my model will usually carry information about various decisions that have been somehow made in a coordinated manner in the protocol. A special kind of consistent flow is a monotonic flow where a larger version implies a larger value. As you will see later, monotonic flows carry information about decisions that have been somehow remembered by the system. Persistently. I'll explain what that means later. And more of such constraints are defined in the paper. If time permits, I will show you how we can use this to drive compiler optimizations. Are there any questions at this point? >>: What is fresh? >> Krzytof Ostrwoski: Fresh. It basically means that nobody is lagging behind. If somewhere in the network a certain version of an event with a certain version appeared, then on every correct node at least that or higher version will also appear. Now, meanwhile, let me explain how we can model larger systems using this approach. So here's an extended version of our previous example. Suppose that multi-cast protocol now has modular structure. First it disseminates updates using some unreliable method. That's taken care of by component U. Then there's lots of recovery, that's component C. I can model the groups as two new objects, valid and green. Likewise, suppose that the replicas of the loss recovery object self-organizes using a membership information that's obtained from some cloud service and we can identify the protocol for contacting that service and modeled as this new orange object in the middle and so on. And you end up with a sort of extended diagram in which different objects span across different sets of locations. And from now I'm going to always represent flows as those yellow sandwiches here. Now, if you look at this diagram, you see that flows in this model serve as typed contracts that decouple objects from one another, very much the same way as type interface objects decouple in Web services in an ordinary work flow. Now this diagram may be useful as a way of modeling systems but it's not an executable program. What I'm going to do now is I can zoom in on portions of this diagram and recursively decompose each object into smaller parts, to the moment where everything in the diagram is expressed in terms of a handful of basic building blocks. Let me zoom in on loss recovery as an example. Now, my contract says that there's single input flow carrying into the object information about what updates have been received at the different nodes across the system. And my job is to generate for decisions and deciding when updates are stable everywhere. I'm going to model each update. I'm sorry. I'm going to model each event as carrying a set value, set of integers, where each integer represents a sequence number of one update. Here in those two events, suppose those two events appear in the correct flow, the replica of loss recovery machine Bob, means that Bob just has the update from one to two and from four to seven, and the replica on machine Alice, Alice has all the updates from one to three and eight. Now, by the way, I'm going to always assume here in and all parts of the model, whenever a flow is carrying a type of information, every event carries complete information. So here every event will contain all the identifiers of all the updates that have been received even if they've been reported once. Doing this little trick makes it much easier for me to model about progress in the system. I can always express progress as monotonicity as I'll explain later. And this is something that can be very easily taken care of by the compiler so the user doesn't have to bother about it. Okay. So we have those set values in the object now on different replicas. One thing, as I mentioned, loss have recovery have to do is tell us when updates are stable across the system so we can do the cleanup. We're having that information, we can just calculate such intersection. The result would be one and two, and that's what should appear as events in the stable flow. So this is very simple. And it's actually one of the building blocks in the model. On aggregation, can have different flavors like the particular kind used here is described by the blue keywords at the bottom. Now I'll explain those in a minute. Meanwhile, let me just give you some intuition how we can model other parts of this object. So suppose that we aggregate events in the right flow using a certain union operator. Well, that will produce the sort of all updates that are present anywhere in the system, because we'll add up all different sets of identifiers that everybody has reported. And now by subtracting those values from the life flow, we can calculate the flow of information about which updates are missing on which nodes. You see how in this fashion we can successively build up by applying the applications build up for one request that can be carried out of the object. And now if you compare this representation of a distributed protocol where the code in C++ or even rules in overlog, clearly this is a much higher level, we can clearly see what kinds of information are involved in the protocol, how the protocol makes decisions, those relationships are presented in a way without cluttering them with various low level details, communication details and other such. At the same time this is a constructive representation because we have a very small number of those building blocks we know how to implement them. And this is also something we can formally reason about. I'm not going to present too many formulas in this talk, but most of it is described in the paper. Aggregation is a transformation on flows where for every event in the outflow beta shown on the right here we have some finite set of events in the input flow alpha. That happened earlier in such that the value here in this event is an aggregate of the values in those events. So that's important. Now the different kinds of aggregations, the different physical protocols or different formal flavors of aggregation will of course differ in the way we select those input events. For the kinds of aggregation that we care about, we can characterize this selection in a slightly more explicit way. And without going too much into detail we can characterize aggregation by two families of functions: Memberships and selectors. Intuitively, for every event in the output flow, memberships first tell us what are the locations that participated in the aggregation from this event was generated, what codes have been communicated to generate that event. Then for each of those locations the selectors tell us what was the version of the event that this particular location has contributed? And since this model, allocation and inversion determines the value uniquely, that's all the information I need to express, completely write out all the values in the output flow in terms of the values in the input flow shown by the formula at the top. Details aren't that important. But what I want to point out is now having such formal characterization, I can distinguish different flavors of aggregation by placing various constraints on the memberships and selectors. So, for example, I'll say an aggregation is in order if the players are monotonic. Intuitively basically means whenever I aggregate I'm using all the latest information I've got or a complete aggregation will be one in which once a nonfaulty mode makes it in into the membership, it will be in every membership with a higher version. Means nobody is left out without good reason, and so on. Now, with all those formal decisions, I can now reason about the object flow diagram. So about the systems behaviors abstractly without having to worry about the specifics of the particular protocol. I only need to know what flavor of aggregation a given protocol implements to know how it transforms my flows. And theorems like the one shown on the bottom that are much so at the paper basically become rules of type checking. For example, here if I know that the input flow is of a certain type, I think this is weekly monotonic, and aggregation is guarded, coordinated and so on, then I know that the output flow is strongly monotonic. I can apply such theorem as a type checking rule for every edge in my object flow diagram, by doing so, for all the edges I can convince myself that the system has the sort of behavior I wanted it to have. >>: I'd like to ask you a question, so you could apply this kind of rule to the abstract diagrams that you had previously. But can you infer these properties from [indiscernible] implement or could you check that it implemented these properties correctly? >> Krzytof Ostrwoski: Presumably you could in the future do that. Never the proposition from the code. I don't know how to do it. But I envision -- it might require some assistance. Maybe annotations in the language or special language features to infer those. But I believe that's possible. Okay. Now aggregation is a basic building block in many systems. For example, in map [indiscernible] data framework, like trial link and so on, the kind of aggregation that we need here to construct distributed protocols as a building block is a little different. In particular, one especially important kind of aggregation that's used to express almost anything has a notion of memory. And let me explain what it means using an example. So suppose we have three replicas of aggregation object consuming those value lids shown on the bottom and calculating certain intersection and producing valid at the top. Transforming flow arrived into flow state as I described before. Now suppose the machine on which the value is calculated suddenly crashes. Meanwhile some other node arrives and joins the system, participates in the very next aggregation. And suppose that this node is totally new. It hasn't caught up with everyone else, hasn't received any update. So the value that appears locally in the right flow reported by the other layer in the system is an empty set. No updates. And that, of course, causes the result of the aggregation to also be empty. And then we have a problem, because previously we reported that items two and three are stable. Maybe some other part of the system already acted upon that information. And now we changed our mind. But if something stable should have remained stable forever. So this is unacceptable. In order to use aggregation object in our system and the object needs to somehow remember what it has reported in the past and make sure that whatever it reports in the future is consistent with the past report. Now, if you recall, every event in this model has a version. And if a flow is monotonic, a higher version implies a larger value. So that would be set inclusion in this case. So if I know that the result of the newer aggregation after the crash is tagged with a higher version number and if I know that my protocol produces a monotonic flow, then I know that this problem will not occur. And the theorems like the ones I've shown you in one of the previous slides actually can help us to prove that certain flows are monotonic. And help us understand exactly how we need to construct our aggregation protocols to produce the desired behavior. And this particular kind of aggregation, the monotonic aggregation is very easy to implement if you have access to a membership service, which was the case in our previous example, if you remember. Now, one aspect of monotonic aggregation that's quite unique that I want to elaborate on is the possibility of implementing in a scaleable manner on a heterogeneous system without needing a single global membership service. Consider this example group of ten nodes across three of them in these administrative domains. I'm going to align them at the top there. And in the middle I'm going to draw the components that run on them. So every vertical segment of the screen will be the sort of components running on one of the nodes. Now, suppose that on a campus there is an aggregation object. Every node on the campus network has a replica of it and those replicas self organize, using a local membership service that's deployed to manage the campus. I'm going to assume that on every administrative domain bigger or smaller there's one such membership service deployed to manage it and that there's some mechanism in place that allows a node to automatically find what is the locally deployed service and register with it. Now, consider a part of the right flow confined to the campus. This is shown here. This flow is transformed by the campus membership -- I'm sorry, by the campus aggregation object to produce that flow at the top, information about which bits are stable across the campus. So we have one input flow one outflow object and no matter how many times the nodes on the campus crash and the recovery we still have a single output flow with a single property. Monotonic. Now, suppose the same happens in the other two domains. On the airport and in the clouds. So in each domain nodes will locally self-organize using a local membership service. This one and this one. And form a single aggregation object which will transform some part of the right flow to produce a single output flow of information about what is stable across that domain. Let's say we have three stable flows. And now further suppose that in every domain a selected leader node and that leader might change actually dynamically, doesn't really matter. A selected leader passes this local information further to a higher level in our infrastructure. So we're going to have such a hierarchical implementation. These three leaders then use membership service deployed for a wide area network to self-organize into a single yellow object, aggregation object of the same flavor. By the way, here each of those four objects can now be implemented in a completely different manner using whatever mechanisms are the most appropriate for that particular domain. The only thing I need to assume all these aggregations are of the same flavor. This is the part where we can introduce heterogenicity into the architecture. I'm going to refer to every heterogeneous aggregations hierarchy as aggregation network. So now the input flow for the yellow object is the union of the output flows from all the domains below it. And the output flow is information about what is stable across the entire system. Now using our approach, I can model this entire dynamically reconfiguring hierarchy using a single object flow diagram as I've explained before and every input flow is the union of the output flows of the aggregation below it. And this diagram is very simple. We have a very small system with only three domains now. In general, of course, nodes might fail, reboot, new nodes might arrive. Those new nodes might reside in different administrative domains. So entire new domains might be introduced to this hierarchy, new levels in the hierarchy might be created and so on. We can still model such a dynamic arbitrarily nested hierarchy as an arbitrary flow diagram except this diagram may be infinite. It may be infinitely tall and every element in it can have infinitely many children. And those infinitely many children come from the fact that if new nodes keep arriving from new domains they get appended in the hierarchy part of the infrastructure that's where the infinite children come from. Fortunately I have proved that any such hierarchy instructed out of monotonic aggregations potentially different protocols, each of which implements monotonic aggregation. In fact, we have as if it were a single global system monotonic aggregation. That's good news because that means that whenever I need to use monotonic aggregation as a building block in my system, as was the case in, for example, loss recovery protocol, you've seen earlier, I can always implement the building block in a hierarchical manner using the approach I just outlined as a hierarchical aggregation network. And it turns out that monotonic aggregation, the same building block you've seen, can be used to express many commonly used protocols such as distributed locking, leader election, various forms of agreement. It seems that we should also be able to model the classical consensus protocol in the portion although we have to work out some details. So potentially for a very wide range of systems we can implement them in a scaleable heterogeneous manner by following this approach. >>: Seems like if you had a cloud, let's say didn't want to be involved in all the peer-to-peer transfers, all the high bandwidth streams, but it was just sitting there to do just ordering, basically, caches and maybe to decide what is the -- wouldn't that make this problem a lot simpler, because it would determine the order. It would determine what is stable. >> Krzytof Ostrwoski: Right. You could use that. You could use cloud only to help you manage membership or you could use cloud for ordering. Of course, then, implementation of each aggregation object would be much simpler. >>: Why wouldn't you want to do that? >> Krzytof Ostrwoski: Potentially sometimes you may want to do work at the client side, right? Because maybe, I don't know, some group of clients is on a network but they can really very efficiently talk to each other, and it is just faster for you to organize this communication in a peer-to-peer fashion, coordination in a peer-to-peer fashion. >>: So can you give me an example of a case where you couldn't wait for the server to time [indiscernible] as ->> Krzytof Ostrwoski: Well, one classical application where people use this with the protocols is, for example, the Wall Street. So in those applications you might want to do that. There are also many scenarios. Actually, our funding comes from part of it from the Air Force. There are a lot of scenarios in which just most of the time you don't have uplink connectivity to the datacenter. And they just want to run your collaboration. In fact, they want to have Web applications without having the Web, right? And these are the types of applications where you're forced to use peer-to-peer approaches and act just like this. Okay. Now just to conclude this section, let me show you a little bit of a performance result. This is a simulation of a system built using the methods that I described earlier from the model. It's a simulation, but it's realistic, so we have actually point-to-point message exchanges, notes from talking rings, those talking around the ring do aggregation so on. They all independently fail and reboot. After rebooting they have green state have to reinitialize and so on. The system size varies from 62 to 32,000 nodes. The meantime to reboot is shown on the X axis, I'm sorry the meantime to fail. The meantime to reboot is five seconds. And we simulate a simple barrier synchronization protocol where we want nodes to enter subsequent phases of processing in a coordinated fashion so no two nodes are one phase apart. That is expressed as a single monotonic aggregation, and that aggregation is then implemented as an aggregation network in a hierarchical manner with lots of services managing the different regions and different layers. And each aggregation object in the hierarchy aggregates ten times per second, but then of course it takes time for the information to surface to the root. So we measure the route to street time. How fast it takes for the system overall to enter the subsequent phases of processing. And that gives us a estimate of round trip time on the Y axis. One thing that we notice is the latency within which the system works is logarithmic. You see every time we grow by a factor of eight, latency grows by the same amount. This is consistent with what we would expect theoretically. The second thing we notice is this structure is quite related to church churn rate where half the system is reconfigured at a given time. And we can work with 20 percent performance penalty. And that's because every aggregation object in the hierarchy basically handles its own failures independently. So when a single node fails, it affects a small system don't have to destabilize the entire network, as would be the case if you used global membership service across this 32,000 node systems, all these failures would cause global membership changes. You could batch them, but by the time you start to batch them, it would increase the response time and you would have failure in a timely fashion and wouldn't be able to do any work. There are more results in the paper so I don't think I have that much time. >>: So this is what -- [indiscernible] aggregation. >> Krzytof Ostrwoski: Right this is a balanced tree. In the relative you'd have to work hard to build a balanced tree. That's part that I didn't evaluate in this paper. Okay. So here's an example of a partial implementation of the loss recovery object you've seen before. And you will recognize the object signature at the Tom inside I've explained it before. Now, for the body, one way to think of it is essentially as a textual representation of this object flow diagram you've seen before shown below here. Now for now please disregard the grayed out statement and just focus on those three highlighted assignments. Every occurrence of an arithmetic operator anything that looks like a function call in general is an inplace declaration of embedded object. You have three such occurrences two embedded objects two of which are aggregations. We can declare internal flows like we would declare local variables. By the way, the order of the statements doesn't matter in this program because these are not one of statements that's run to completion. There's nothing sequential about this program. Basically, these are all just declarations so implicitly concurrent, with respect to one another. And every time we pass some arguments to an operator or consume the result, we declare a flow dependency. So we declare those black arrows in the diagram. Now, notice that we don't need to speak the type of aggregation. Certain things are assumed. Most of these properties are taken for granted unless you specifically say that you don't want them. One of them is implemented by the grayed out statement. I'll talk about it in a minute. And finally notice that we never talk about membership here. It's implied in order to implement certain types of operators, for example monotonic aggregation like the one used here, you either need to use global membership services to self-organize or you have to construct the scaleable aggregation network like I showed you before, then you use lots of membership services for that. And anyway, there's no need to burden the programmer with any of this. The program doesn't really care whether you have some membership in the system as long as all the flows are transformed in the way you want them to. So we consider this implementation detail that's taken care of by the compiler or even at the time of deployment. Now, just very briefly what it means to translate such program, compile such program basically generate an object graph like the one on the right where the root component is the other replica of the object you're defining whenever you have some embedded object you create a replica of it since these are building blocks just call this pregenerated for you. And the structure of the program determines how events are going to fly around this infrastructure within each replica. I'm not going to go into detail but basically you can think of production rule systems like those based on the [indiscernible] algorithm where you have graphs of elements and every time one of them gets updated that causes other events to get updated accordingly. Okay. Now back to our language. Just like in any high level language we have a conditional statement, the where clause. The argument here in parentheses is boolean flow. Any type of boolean flow this one is represented by the highlighted expression. And the way it works is that whenever true event appears in the boolean flow at my replica, locally the body of the workloads gets activated whenever a false event appears or when there was no event, the body gets deactivated. So in a typical language, control -- conditional statements tell us which chunk of the code has executed. It's a binary yes or no. Here conditional statement tells us which subset, which subset of replicas is running that code. So in this case which subset of replicas is doing aggregation. To deactivate the body of code basically means to suspend all the local flow of events. In case of aggregation that means that the local replica will not submit its own local values from the aggregation, will abstain from the aggregation. So the meaning of this entire statement here is that the only replicas that will vote over which updates are stable are replicas that first know the latest value in the stable flow and have all the updates that are considered stable. Which makes sense since if some node doesn't have the updates that the system considers to be stable, then it can't consider itself to be part of the system, right? It has to abstain from election, from aggregation. So conditional statements like this can be used to model all sorts of joint conditions that you have to put the new members through to ensure that they're synchronized with everyone else. Okay. For the end let me show you one complete example. Very simple distributed locking protocol implemented using leader election. We have a single input flow boolean flow once. A true value on a given location in that flow means that the particular node wishes to acquire the log and false volume is the node no longer needs the log. Now, when an event with a true value appears at the replica this causes this body to be locally activated, which means that the replica of stable elect on that node will get activated, which means that we will participate in aggregation, I'll show you in a second. So once we communicate our intent to acquire the log, our local unique identifier, this is just a building flow, gets submitted into that replica. And once we get elected as a leader, we remain elected as leader. Again I'm going to explain it in a second. So once you find out that the leader identifies the same as ours, we can respond to the application that yes we've got the lock now and we know that we will continue to be the lock holder until the moment where we just give up ourselves. And here's the implementation of the scaleable election protocol. So we have single input flow carrying the identifiers of the candidates for election. We elect a leader by just picking the minimum identifier of the candidates and report it back. Now, to ensure that the result of election is stable, as I said we need to, I'm going to request that if my -- if I know what is -- I can only participate in election if I know who is the current leader and my own identifier is not smaller than the one of the leader. If my identifier was smaller and I participated in the election then I would overthrow the leader, I would become the leader. I would violate this condition if I wanted. So this joint condition ensures the new nodes coming to the system will not cause the change of the elected leader until the moment when the leader gives up. And so this is actually a correct protocol. The only problem is that it can lead to starvation but that can be fixed. I'm not going to go into detail today. I'm going to skip this slide. Okay. To conclude this talk, let me say a few words about other works I've done. Essentially everything I did grew in this way or another from my early work on scaleable reliable multi-cast here. I've built a is the many in C# that can deliver data close to network speeds and scale to hundreds of nodes with almost no performance penalty. Actually penalty that's observed is due to garbage collection overheads that I can control. But the main challenge in building this system was due to the fact that with hundreds of nodes, all of them basically at the maximum capacity, CPU, UIs and network wise, it's very easy for any little event anywhere in the system to struggle as the entire system, because these events tend to cause cascading problems. Reactive recovery techniques usually cause more timeouts and those timeouts cause even more timeouts and so on. And so most of the time I spent profiling and profiling carefully trying to understand the source of these instabilities and preventing them. Another major line of work was working, enabling problems, agnostic Web applications and that includes the demo you've seen earlier. And that was mostly P issued on the Web services crowd. And finally as we developed concurrent obstruction for multiple platforms over there on the top, it combines the benefits of producing a model. In a nutshell the idea is a single component in my system can automatically decompose itself into multiple replicas then those replicas can handle method calls in parallel and as necessary it can shrink itself back into a single replica become an ordinary reactor. All this is happening in a way that's completely transparent to the user calling into the object. And it's almost transparent to the programmer. The programmer has to implement some method to split a replica in two and merge two replicas back, but all the coordination, all the synchronization and scheduling so on is taken care of by the platform. And so an object implemented this way can actually continue to go through such phases where it behaves like an ordinary reactor and phases where it gets replicated, can shrink and grow and a model very similar to map reduce. Okay. And, well, in this talk I decided to focus on distributed programming concepts because I felt that would be the most relevant for you. But I'd be happy to talk about any of these other projects in case you're interested off line. And that's all I have today. Thank you. >> Jim Larus: Any questions? Okay. Let's thank Chris. >> Krzytof Ostrwoski: Thank you. [applause]