1 >>: So it is my great pleasure to introduce to you Nalini Belaramani, who is interviewing here from U.T. Austin. Her advisor is Mike Dahlin. She's done work in a number of different areas in distributed systems. Her thesis work is concentrated on distributed storage systems, which is what she'll be telling us about today. >> Nalani Belaramani: Hi. Thank you for the introduction. I'm Nalani Belaramani, and I'm here to talk about my work in storage systems. So data plays a very important role in our lives. We have it almost everywhere, in our laptops, in our mobile devices, they're on servers in the computer, they're on enterprise servers. The importance of data cannot be -- I was just talking to J.D. about, you know, backups and stuff. I just lost my data recently, and it's a pain when you lose data. So it becomes important for data storage systems to give us certain guarantees. We want them to be, we want our data to be durable, so if we store data there, it should be accessible at any time. We want it to be available so we can -- when I want to use it, I can use it, and give us good performance so we can get to it quickly. A common technique which has been used recently is replication. You make copies of data and store it on different machines, and that helps durability, because there's an extra copy, which if it fails, you can access the other one. And it also helps performance, because you can access the closest copy to you. Now, replication is not a new solution. And I'm not saying I did this. It's been done for years. In fact, we've had so many systems over the past years using replication as a technique for data storage systems. Now, this is only a subset of the systems which have been published. And there are a lot more out there. The thing to notice is that even though we spent 20 years on this, the new systems are still being developed. And the question to ask is why? I mean, haven't we beaten this to the grave? The reason is, every storage system needs to make these fundamental trade-offs between consistency, availability and partition resilience. You cannot have all three properties. It's been proven. You can only have two in your system. So each of these existing systems, depending on their workloads or their goals, they fit into a different point in this design, in these trade-offs. And so when you actually have new technologies or new workloads, you kind of move your trade-offs a little bit, and, I mean, they require new trade-offs, and so what you do is you build new systems. 2 Now, the problem is building a new system takes time. You spend months and years to build a system. You spend a lot of your effort on a lot of reengineering the wheel. You're building the storage, you know, how data's stored from scratch, how to maintain bookkeeping, how to transfer updates. And, you know, instead of spending on the higher level goals, your effort's being spent on re-implementing all of these little things again. Why don't we modify an existing system? The problem is many existing systems, the implementation is very much related to the original design and their trade-offs. So if they -- yes? >>: I mean, Coda has [inaudible]. >> Nalani Belaramani: >>: So really it's been modified for functionality. >> Nalani Belaramani: >>: Yes. For ten years, yes, it's been. So is that really true they're not modifying for new functionality? >> Nalani Belaramani: No, I think it has been true. modification takes such a long time. >>: But the thing is every That's why it's a Ph.D. thesis. >> Nalani Belaramani: That's why it's a Ph.D. thesis, exactly. It takes ages to modify them, because every time you have -- if you want to add a new feature, you have to modify a whole bunch of code, and, you know, it makes your life difficult. Why can't we just make this process easier, like is there a way in which I can build a system easily? And this is what I want to do, you know. So when we were thinking about this problem, we thought, how about we use a micro kernel approach? We have the basic mechanisms. I mean, basic mechanisms are original, consistency bookkeeping. We have a good set of mechanisms. Every system can be developed by implementing -- implementing different policies over these mechanisms. So that would make your life easier. We have this approach. I went to Professor Dahlin and told him, hey, I think I know 3 how to make systems -- how to build systems easily. He's like, all right. Why don't you build ten systems for me? Otherwise, you don't graduate. And I'm like oh, shit. [laughter] >> Nalani Belaramani: I knew what should go in the mechanisms, but I had no idea how to specify policy. Like how do we actually build these systems on that? I'm like how do I -- like, uh-uh. And I thought I wouldn't graduate, I thought I'd be 45 years old and still working and I'd see my children graduate. Yeah, let's not get there. >>: Mike would have a beard. >> Nalani Belaramani: Yes, by then he would still have a beard. So I looked, went back and I looked at all the -- re-looked at all the systems and I figured out a way which in which we can specify policy easily. That is what this talk about, how to make policy specifications of systems easily. And thankfully, it worked. And instead of ten, I gave him twelve, and these systems fit into a wide range of, in different parts of the design space, and they cover different consistency semantics, how updates are propagated, and the cherry on the ice cream is that we did all of them in less than 100 lines of code. Now, you're probably wondering how we did this. Don't worry. minutes you're going to find out all about this. The next 20, 40 So how do you evaluate the merit of this work? Good research has always been categorized as being either haiku or judo. Haiku is a Japanese form of poetry, which emphasizes on elegance in its small form. Having compact and small primitives which make it easier to develop systems. And judo is, you know, a Japanese martial arts form. So for a lay person to, it's very difficult to appreciate haiku, because you may not be able to appreciate the elegance. But if you meet a person in a judo master in a bar, you very quickly know he's a master once you get into a fight with him. So this work, again, has both aspects of haiku and judo. We have identified the -- we have identified the small set of primitive, which distill all these abstractions of distributed computing system. It may be difficult to appreciate at first go, but we took these primitives and actually implemented 12 diverse systems, and that's where our judo comes in. 4 So in the next 40 minutes or so, I'm going to talk about -- I'm going to first talk about the two key ideas which went into our -- how, into making the policy specification easy. And then I'm going to talk about our policy infrastructure, our framework, which makes it easy, over which you can build the systems. And the last part is how we actually went about building them or the evaluation of the systems we built on those. So the first key idea is the microkernel approach. We have these mechanisms, they do grunt work for you, and the policy is where the design is. And that would make -- so the next part, the next big question is, how can we make policy specification easy? If you look at the systems, the multitudes of systems, and if you take a step back and look beyond their technical details or their optimizations, you realize that they actually trying to answer three major questions. One is, where is the data stored? Second, how are updates or information propagated among these nodes, and the third is what is the consistency or durability requirements you want. So the first two questions actually boil down to routing. It's just how data moves around these nodes. The second one is just blocking. Basically, you block access to data until safe -- until these semantics are guaranteed. So let's have a look at routing. So we took a couple of systems and you compare. So Bayou, for example, sets up peer-to-peer connections to transfer updates. Chain replication moves updates through a chain. Coda is a client server system and so most of the information transfer occurs along client server lines. TierStore uses a hierarchical tree-based systems to propagate updates. So essentially, that's what you're doing, just setting when, which node and what date and how data flows through the system. It's just routing. The second thing is blocking. So if you look at consistency, what consistency tries to do, you try to answer the question, which version of data I can access. If I'm going to access only valid data and my local version is either old or it's invalid, all I have to do is just block until I can get valid data. For some systems, they have a stronger durability requirement in which, for example, in a client server system, you want to make sure that a write is on a server before it's considered complete. So when I issue the write, before I return, I need to block until I know for sure that it's in the server. And just having blocking and routing is actually sufficient to do a lot of policy specification. Yes? 5 >>: So does when an operation is blocked on something, does that convey some kind of priority to the system about the relative importance of that operation or, say, that write being made persistent as opposed to some other write, or is that not a relevant aspect of the systems that you're trying to capture? Is it not something that they did? Something like if I had multiple operations, which one ->> Nalani Belaramani: So from what I understand your question, rephrasing, is if for those systems, if they did a write, they differentiated between writes which were blocked and which couldn't be blocked? Is that what you're trying to ask? >>: [inaudible] trying to infer the safety guarantees, or is this a guarantee being provided by whether they block or not? >>: I guess I was just asking, if you have multiple operations, is there ever a need to convey some kind of priority between the operations, which ones you would like to complete sooner? Do any of these other of these 12 systems that you've reimplemented, did they do that in their original implementation, is that important, and is that something that's going to be captured in the implementation? >> Nalani Belaramani: I can't remember. We have a scheme of implementing priorities. You can tell this operation needs to be prioritized over the other operation. But I'm not sure if the other -- I can't remember right now if they did that. I know our implementations didn't. And so I'm guessing they didn't either. Because we tried to follow as close as we could to their implementation. But there is worth in prioritizing. So as we said that, you know, if I do have a block, a write, an operation which is blocked, I need to count on routing to actually make sure that I get the data so that I can unblock. So in some sense, blocking guarantees the safety of the system and make sure that you're not doing anything unsafe, whereas routing helps to enforce the liveness of the system, where you're making sure the system unblocks and makes progress. So system development basically can be reduced to mechanism, if we separate mechanism, which does the grunt work and we have policy, and the policy, in our infrastructure is divided into routing and blocking. And separating the policy gives us a very clean structured approach to designing our system. And it gives you a separation of concerns. And because they're so different, we can actually have special programatic abstractions for each one of them so we can make 6 implementation easier. Yes? >>: So this approach seems very elegant, very clean. I'm wondering if you're losing anything in the translation. So you say, oh, in 100 lines of code or less, we can use this framework and get all these different semantics for the file systems. >> Nalani Belaramani: Yes. >>: I'm wondering, is it really that easy, or are there certain aspects of the file system that you're not exactly emulating? I'm just wondering at some point in the talk you'll say something like, okay, well, I was able to do, say, 50 lines of code, but there was some weird feature that I couldn't really implement in my framework because as it turns out, there's some ->> Nalani Belaramani: Well, we haven't actually come across that feature yet. It would be great to come across a feature like that so we could actually make changes and figure out where our -- where this fails. >>: So in other words you're saying for all of the 12 systems you looked at, you were able to completely emulate the semantics of that system? >> Nalani Belaramani: Yes. So this is our -- so I'm going to talk about our policy architecture, which actually takes these approaches and realizes it and gives you a clean API over which you can actually develop, make an easy to develop policy. So the first part of the system is, of course -- so we have a layer of mechanisms and the policy API, that's what it provides you with. And what programmers or system designers do is specify routing policy and blocking policy over this API. >>: I have a question here. systems. >> Nalani Belaramani: >>: Um-hmm. And you've met all the specifications, is what you're saying? >> Nalani Belaramani: >>: So you said you have [unintelligible] qualified Yes. How do you know that? Because most people say, you know, I can't figure out. 7 [laughter] >> Nalani Belaramani: >>: I understand. [unintelligible]. >> Nalani Belaramani: So we tried to look at the basic techniques. They have callbacks, leases. We try to stick to what they did. I mean, it takes -- you have to study each of these systems to figure out exactly what they're doing and how to map it into our world. We may not have done the smaller optimizations, which they've used, but our -- from what we evaluated, our -- I mean, how the consistency semantics guarantee, how they propagate updates among their nodes and the availability of data is we try and match them -- we've matched them as much as we can. I can go into the details of the implementations later for one of them if you want offline. >>: One quick question. How much reputation policies, things like how many rep cas are stored, where you store them. Is this part of routing? >> Nalani Belaramani: That's part of routing. We give you a general framework. If you want five replicas, you store that. If you want ten replicas, you specify it in routing. So the point is how do we actually specify routing. How can we make it easy for you to say I want ten replicas or I want three replicas? >>: Let me just follow up on questions. >> Nalani Belaramani: There's really a verification issue here. Yes. >>: How do you validate your implementation against these other file systems to be sure -- it might look the same from 10,000 feet, but did you actually get it right. And the prioritization issue aside, possibly some others, how do you know? When you write simulators, are you off in la-la land, or are you spot on? It's really hard to know. >> Nalani Belaramani: So we tried to do a comparison, sort of, of their model and our model. But how do you exactly know? We could probably only tell that if we model check their implementation and our implementation. I don't know any implementation which has been model checked. Our implementation, we're working on -- one of the next things to do is actually model check the implementation, being 8 able to ->>: I mean, were you able to reproduce results in some papers? >> Nalani Belaramani: >>: We were. Similar, error bounds? We saw -- I'm just -- >> Nalani Belaramani: We were able to get the general trends of their papers. We redid some of the experiments, and the trends we saw, since we can't match with their exact numbers, but the trends were -- the overheads were in the same ballpark. >>: Okay, great, thank you. >> Nalani Belaramani: All right. So for the layer of mechanisms, we actually used PRACTI -- oh, forgot I mentioned that, sorry, for this part, my work basically focused on the policy API and how to specify routing and policy. And the mechanisms were joint work with a couple of colleagues. So the mechanisms, what you need is a flexible mechanism layer, which can fit into -- which can help you build all the various systems. And so we used PRACTI for that. PRACTI, you know, can be used for -- exposes new design points in the middle because it supports a three, you know, supports a lot of the design space. And it takes care of all the bookkeeping, the storage, local read/write operations for you and so you don't really need to worry about that. For this talk, we're not going to go into the details of PRACTI, because it's very technical and could take too much time. But one thing you need to know is PRACTI gives us a great information exchange primitive, which is called subscriptions. What subscriptions do is it allows one node to exchange updates with another node, and you can choose what subset of data you're interested in, and if ever you want to transfer just the whole updates or you just want notifications. So in the PRACTI world, they call it bodies and invalidation. The thing to note is when you actually have a subscription established between two nodes, any update which the source node sees, it will forward it to the destination node. So it's a way of getting notifications. So the next part of the talk is policy. So using these mechanisms and this basic primitives PRACTI gives us, how can we specify policy? 9 So as I mentioned, routing is just making sure data propagates. And what that boils down to is you're using, you're establishing subscriptions between different nodes, according to your system design. Now, the key thing for us to do next is to have a clean and simple API which makes it easier to set up subscriptions, and the second thing is perhaps we've actually used a domain-specific language to help you make it -- give you a concise routing policy. Yes? >>: Is the development of the policy engine in tandem with PRACTI, or was PRACTI ever modified in the process of developing the policy language specification? >>: So was it that you had these mechanisms and found them suitable to implement all your policies, or as you're implementing policies, did it require changes to the mechanisms? >> Nalani Belaramani: We required, for some of the systems, we required changes to the policy -- I mean to the mechanisms. So but then at the end of it, when we started -- we started with one or two systems. And we realized there were some parts of PRACTI which weren't sufficient, and we added those mechanisms. As we went along, towards the last few systems we implemented, the PRACTI mechanisms didn't need any changes. >>: How large is PRACTI? >> Nalani Belaramani: 30,000 lines of code. All right. So the clean and simple API we have is actually just nine upcalls from the mechanisms and seven downcalls. Now, what you're trying to do for routing is basically try and set up subscriptions based on the information you get from the mechanism. For example, if a local read was blocked, or if a connection was established, or if I got a certain message, and based on that, you try and tell the mechanisms to establish subscriptions or remove them or just send a single update. And with this set of, like, 20 methods, we were able to implement all the various systems. Yes? >>: [inaudible]. >> Nalani Belaramani: >>: Sorry, come again? Did you [inaudible] weekly operation in Coda? 10 >> Nalani Belaramani: Yes. Yes, we did. And so to specify routing, it basically boils down you too you get these events, you call the action. So it's an event based development platform and it's routing. So we realized why don't we just use a language which was actually meant for routing. So there's a language called overlog, which is developed by Berkeley, Intel Berkeley, which is used to set up network overlays. And we took that language, and we modified it a little bit to fit our needs, and we were -- and it's based on a -- its main abstractions are tables and rules and events. So whenever an event occurs, it fires a set of rules which might, which might in turn call actions or, I mean, call the downcalls or fire other rules. The advantage of using network and overlay language is -- a network overlay language is you can actually have policy events based on network connections. So say if I detected a peer, I should establish a subscription to it, to get an updates. And so it makes it easy to make policy based on network events or just local mechanism events. Just to give you a taste of what this language looks like, let's say we have this simple rule we want to do. A local read is blocked, I establish subscriptions from the server to me. So I can get the data, and I can unblock. Now, this is what overlog looks like. You should read it from left to right. What it looks like, it takes some getting used to, but eventually it makes it easy to write. So you might get this triggering event, which would tell you that, all right this operation's blocked. The at sign, @C is the location. So this is just tell you that this is at the client or it's a place holder for your local node. It has tables so you can do a table lookup. And you can set up conditions and assignments and, you know, based on that you can generate another resulting event. And in this case, the resulting event ends up to be one of our downcalls, and it will call the system to set up the subscription. And that's taken care of by the mechanisms. So, you know, yes, it takes a little getting used to, but it actually leads to very concise code which greatly follows the pseudo code very closely. So the next comes to blocking. Now, as we mentioned, blocking comes, you block access to data as long as to guarantee safety, and what is the best place to block, other than, you know, you just block before and after you access data to so that 11 you make sure that before you do the access or after you do the access, the semantics are guaranteed. So PADS gives us five blocking points which are at the access interface. And what the system designer's job is to specify when they can unblock or when they should go through. So for that, we actually give you five predefined -- sorry, four built-in conditions you can use to specify these blocking points. And there's a one extensible condition which is the L message for if you want to communicate with the routing layer. So for example, I'm going down to the bottom one is when I want to, when I do a local write and I want to make sure it's at the server, you know, I can block after I've done the write and wait for the routing to take care of detecting that it's reached the server and the server sends an acknowledgment back to me, and that can help me unblock this write, and it will return. Or say if I only want to read valid data, then I just block a read until my local data is valid. And with these conditions, we've actually been able to implement a wide range of consistent semantics. >>: Is the server also built using this language? >> Nalani Belaramani: >>: Yes. Can you specify how metadata flows, or only data? >> Nalani Belaramani: You can specify metadata. So the metadata, in the subscriptions we actually have, we separated into metadata and data. So the invalidations are metadata and the updates are data. Sorry, data is updates. >>: Actually metadata files. For example, can you control -- the file is not just data, but also with permission bits and things like that. So does that also [inaudible]? >> Nalani Belaramani: So the -- this can, so the basic -- we actually limited by our mechanisms. In our mechanisms, an object, a data stored as a single object. So you can always have an object to file mapping, in which you can have one object storing metadata and the other object storing the actual data of the file. So you can set up routing subscriptions based on metadata as well. You can choose what data set you want to propagate through on a subscription. 12 >>: Okay. So where is that mapping done? Like if you wanted to get, let's say, opposing semantics, sounds like you'd have to set up as a subscription as a metadata and the subscription for data is what you're saying. Is that done in this language, or is that done ->> Nalani Belaramani: It's done, what that would entail is you would have a library function which converts [unintelligible] to the object IDs. How it's routed is done in this language. >>: Okay. Have you pulled a transfer all the way through? Maybe that's getting to the end of the talk. I'm wondering how real this is. In other words, were you able to build and complete some [unintelligible] compliant file system that you could do things like run Macon [phonetic] and things like that. >> Nalani Belaramani: We have an FS interface or an object interface and we were able to do reads and writes for some of them. We did an Andrew [phonetic] benchmark on some systems but I don't have the data with me right now. >>: Okay. >> Nalani Belaramani: So basically, you know, if we have sound mechanisms and we use declarative way of specifying policy, we should be able to specify storage very quickly, right? And proof of this is this is just the code of an actual system we implemented. It's TierStore. Since we implemented it on PADS, we call it P-TierStore. I don't expect you to be able to read these, but the point of -- because it's all in overlog, and it's too small. But the point is you can have an actual system running on a single page and with all the recovery mechanisms handling failures and, you know, just getting things right. And we've actually deployed it on certain machines and have it run, and it's just a single page you, know. So it makes development easy. It makes it easy to do code review. It makes it easy to change features if you want to, and that's where our power lies. So we go on, how do we evaluate an approach? And there usually is no -- yes, I can tell you it makes things easier. I can build one system. Look, I built this in two weeks. But there really is no quantitative way of saying this method is easier than the other method. 13 So we base our evaluation on experience. Instead of just building just one system, we built 12. We picked them from different parts of the design space, and I must say that these are not, you know, but compatible versions. They're not exactly the same, but we've captured most of the consistency availability performance trade-offs they've made. And so you know, let's have a look at what we've done. >>: Can I jump in with another question? >> Nalani Belaramani: >>: Yes. If I'm jumping to the end of your talk, please defer me. >> Nalani Belaramani: Yes. >>: So you've got a very nice meta language here for specifying higher level semantics in your file system. Did you think about going one level up in abstraction? And building in some sense a conflict draft that could encapsulate the space of some file systems. Some of these different file systems are going to have features or primitives or policies that are not compatible. Others you may be able to mix and match. Seems like you could create a graph of the space of legal file systems of which these would be a point and automatically synthesize new file systems and maybe even dynamically adapt to those. >> Nalani Belaramani: That would be a great thing to have, but I don't think we have that. I also think there might be, as trade-offs change, new technology. So before we would have mostly connected infrastructure and move on to delay tolerant networks. If you have just preexisting techniques, they may not be sufficient for new environments. So, you know, just generating, if you have a really intelligent way, if we could use the AI team to help us do that, that would be great. But we don't have -- we're not there yet. >>: Have you discussed it internally? >> Nalani Belaramani: >>: No, we haven't. Okay. >> Nalani Belaramani: Because for us, it's just making it -- I guess that's the next step. We have discussed about having an adaptive, adaptive file system which adapts to your workloads and your network, but we haven't discussed on having 14 something generate, a generation of a file system. So we picked systems, the 12 systems we built have been picked from various parts of the design space, and the one with the asterisks are basically we took an existing system, our implementation of an existing system and we modified it for new workloads, and I'll get into a few them a little later. And each of these systems actually cover a lot of features. We've implemented leases -- I mean basic replications techniques including leases, cooperative caching, callbacks, and we've tried to cover as much of the common techniques we can to show that this framework actually is flexible. Was it easy to build? Each of these systems actually, because we had a concise domain specific language, we could write them in less than 100 routing rules. And since consistency was just blocking in which you specify conditions, each of them were done in less than five or six conditions to implement. And -- yes? >>: What are the asterisks? >> Nalani Belaramani: The asterisks are we took an existing system, Bayou, for example, and we modified it to suit to a different environment. I'm going to talk about Bayou and Coda in the next slide, if you just were a minute. So what the asterisk means is basically were, were the systems easy -- once -- yes? >>: Can you comment why some of them has like 70 rules, some of them has like nine or six. What are the main difference? I'm not familiar with -- to see the [inaudible] range of systems. >> Nalani Belaramani: The systems is huge. So what Bayou does -- let's compare by use and Pangaea, for example. Bayou is just nine rules. What it does, so the rules are not only for the data routing, specifying out data, but it also is the network management rules. So in Bayou, what it does, it keeps track of what peers you have, and it randomly connects to a peer to exchange data. What Pangaea does is it actually keeps -- it has a notion of goal nodes. It's a completely connected graph. So for each object, you have a set of three goal nodes, which are always primitive. From there, you go on to talk about, from there you make graphs to other non-goal nodes or bronze nodes, they call it. If one goal node 15 fails, you reestablish another goal node. So maintaining that graph complexity. So once you have that graph complexity, propagating updates is simple. You just set up subscriptions. So the 75 lines come from maintaining that graphs. >>: Are all this systems open source? >> Nalani Belaramani: How do you know, like, the original system? The original lines of code? >>: So they are all open source? this Pangaea -- This I'm saying how do you understand to say like >> Nalani Belaramani: We looked at the papers and we read their papers and most of them were read by the papers. >>: I see. >> Nalani Belaramani: Yes. All right. So now coming to some of the systems, like, for example, Bayou, its basic underlying mechanism assumes full replication, because it's a server-to-server exchange. Server-based protocol. Server is talking to other servers. Now, if I wanted to use that same protocol to synch up with my laptop, it will just send me the whole volume of data instead of just one directory of this, one my subset of data which I care about. So it can't really work in cases of partial replication. So what we did is we added small device support in which, you know, when a small device is exchanging updates with a server, it can just specify the data it cares about. And it fits into a different part of the design space. And that took just a change of four rules. Coda, on the other hand, is a client server system in which it restricts the communication for clients so that it can only talk to a server. And it can't take advantage of nearby peers to retrieve data. You know, if we wanted to add cooperative caching to it, we just had to change 13 rules. Sorry, add 13 rules. Basically keeping some network tracking of whether peers are available or not. And when I needed data, get it from a peer. And with that, we were able to sort of adapt Coda to support a new workload or -- yes? >>: So when Coda was designed, the designers made some assumptions about how pieces put together to make the system safe. In particular, one of the reasons you go to the server is to get a lease to make sure you're actually allowed to -- 16 >> Nalani Belaramani: To access. >>: To read that data. How do you know when you add these additional rules that you aren't violating some global constraint. Is it a property of the declarative rule set that you can't make things any less correct by adding rules, or do you have to reason carefully about the whole system every time you [inaudible] rule set? >> Nalani Belaramani: No, actually, the safety is like once you have your blocking policy right, you have the conditions right, no matter what you do to the rules, it will not let you access safe data. >>: How do you get your lease? >> Nalani Belaramani: In this case, what we do is we actually separate metadata from data. So you can get the lease from the server, but the actual big chunk of data, you can get it from the client. Yes? >>: [inaudible] how servers users language on Coda, when the callback rate hits the server, it invalidates all the callbacks to all their clients because you've invaded that right. You've written Coda server as well ->> Nalani Belaramani: >>: Is there separate policies distinct to the servers? >> Nalani Belaramani: >>: With -- Yes. And we include that in our count. Okay. >> Nalani Belaramani: So we did, actually, a head-to-head comparison of Coda with our implementation. We did, in different work, it was reads and writing for small objects and big, large objects. A cold read corresponds to a read when a client does a read and the object's not locally available, and so it can get the object. It has to contact the server to get the object. A hot read is when the object's locally available. Connected write is when you do a write and I'm connected to the server and the server needs to break callbacks with the other clients. And a disconnected write is when I'm disconnected from the server. 17 For most of them, we are pretty close. But for some of them we aren't, and we have some theories, we have some modeling about why we are so bad. But in general, we could say our prototype is actually, its performance is decent for prototyping purposes and for testing out new ideas and if it works for you, you could still use it. But if you need better performance, we may need to fine tune it a little more. >>: Question? >> Nalani Belaramani: Yes. >>: Like [unintelligible] I have a generic [unintelligible] policy, then the general question in my mind is if I write to a system like GFS, right, I really care about performance, I really care about scale, there must be a lot of [unintelligible] for making GFS working. >> Nalani Belaramani: Yes. >>: In your design, do you believe or do you think you can support, saying you can allow people to write a GFS on their system and the performance would be acceptable by [unintelligible]? >> Nalani Belaramani: So -- >>: Is there any fundamental reasons you don't think you'd be there? different purpose? See what I'm saying? Just a >> Nalani Belaramani: I understand. So there is, I would say our prototype has -- so it is a Java based prototype. We haven't spent the engineering effort to make it completely optimized. And I'm sure we did, we could make -- if the mechanism layer was very optimized, we could come up to the performance for GFS. But the advantage of actually having this prototype is not the raw performance, but having, being able to pick the right design for your system. So the fact that you can modify the rules or implement rules easily, you can try out one design and, you know, put the performance -- put the prototype in there, the performance isn't that bad for the mean and for trying it out. See if that design you had, if that really works or not for your system. And if it doesn't, you can actually tweak the design, refine it until you get to the final 18 design you want and then you spend your optimizing effort. >>: Once you have the perfect policy -- >> Nalani Belaramani: Policy. >>: And you're happy with the policy, and then if you decide to throw away your beautiful abstractions and just blast down to as close to the bare metal as you could, I think that's ->> Nalani Belaramani: That would be -- >>: What you're asking, how much of a tax are you paying to maintain your high level APIs. Because in some systems, I would imagine it could be almost zero tax. In other systems, it's actually really significant. >> Nalani Belaramani: So if we were to optimize to the bare metal, the mechanisms, the API is, it doesn't put any tax over it. >>: Okay. Tax-free? >> Nalani Belaramani: Um-hmm. Comes free. >>: You look at ways of actually compile down your abstractions once you -- you can play around it, and you find the right way of routing and everything, is there a way to then go to the bare metal, just get rid of your abstraction and all these boundaries of modularities and actually better performance? >> Nalani Belaramani: So you would -- currently, when you write in rover log, we actually translate it into java. So it actually, and the mechanisms are in java. You would -- I do believe that these boundaries are -- it's just a simple boundary on how you should establish subscriptions. And if that's a class in bare metal, that's what we're calling. So I think that this -- you could, if you tried to get rid of the boundaries, I don't think there are much boundaries to get rid of in this case. >>: [unintelligible] the performance is because of prototype, because of using java or because of modularity. >> Nalani Belaramani: I think it's because of the prototype and the java, not 19 because of the modularity. >>: Yes? Actually, finish your slides. I'll ask my question later. >> Nalani Belaramani: All right. So this basically summarizes my work. We've tried to do, distill system design into small primitives in which you can build 12 diverse systems. I mean, in which you can build systems, and we have a policy architecture with the clean API where you can test out different policies. And using that policy architecture, we've actually built 12 diverse systems and that's what my work is. Going to the future directions next, if that's okay with ->>: Before we leave this, you mentioned that you had to add a couple of mechanisms as you went along. Could you briefly tell us? >> Nalani Belaramani: So one of the mechanisms we had to add was being able to store routing information persistently. So for example, in Pangaea, they store, for every object stores the location of the goal nodes. You know, so when you get a directory object, you know where the goal node of that, of the file is so that you can contact that, the goal node. And initially, we thought routing can be completely separate from data. But we realized that that's not the case. So we had to put a link towards it so you could actually store the locations along with the file object. Like basically, when I do a read of an object, it sends -- using routing to read locally stored objects so that I can get the configuration information or location to route data. Yes? >>: One question I have is other mechanisms that you have in the layer that are kind of specifically used only by one or very few systems -- like basically, like one idea I had is of course I could go ahead and implement all these 12 systems, call them mechanisms and then specify policy language that just, you know ->> Nalani Belaramani: >>: The question is how much reuse is there between the systems? >> Nalani Belaramani: >>: Just picks one or two, pick one, pick two. There's a lot of reuse between the systems. Framing it another way, are there any mechanisms that are used by say only one 20 of those systems, or a few? >> Nalani Belaramani: I don't think so, actually. Because our basic mechanism is you transfer updates, and that's just one primitive, and that's used by everybody if you boil down to. Yeah, there isn't anything just used by one. And having that one policy defeats our purpose of helping you build a new system. That's what we want to do. >>: There's a fine line here, how much you push down. >> Nalani Belaramani: Everything, yeah. >>: You know, you can always make your policy specification simple if you just push everything down. >> Nalani Belaramani: No, we tried to push down everything we could, but not everything. It's flexible enough to build new systems. >>: [inaudible] how difficult was it to develop the underlying mechanism? You can imagine there being this huge cross-product of policies everyone can specify. It would be a nightmare and you add a new system that traces a different code method. Was it fairly easy to debug because there was so much sharing amongst the higher level primitives? >> Nalani Belaramani: We actually, the mechanisms were -- they're actually a pretty complicated bunch so they were implemented separately from the policies. So, in fact, we actually implemented the mechanisms first before we came up with the policy architecture. The story was we had the mechanisms, and we didn't know how to build on it. So it wasn't made in mind with a specific policy. It was -- so just debugging the mechanisms were independent of all, whatever system we wanted to build. So it wasn't -- so any changes we made, again, was just restricted within the mechanism layer. So we -- yeah. >>: Is there anything about security policies, [unintelligible] and encryption? >> Nalani Belaramani: Not right now. That hasn't been the focus of my work. But there's a colleague of mine who is looking into security and malicious nodes. >>: [inaudible] just the basic security policy of most file systems take up a large chunk of the code so if you thought about -- 21 >> Nalani Belaramani: Our current implementation doesn't support that. And I can imagine that extending out as another maybe a declarative way. I haven't given it much thought, but I can imagine that. Yes? >>: I think [unintelligible] you provide mechanism. If I use your mechanism, I can just play with my policy and attach it. Actually, I was following up with James' question. So basically, the make the debugging really hard. If I'm a user of your mechanism, I'm coming up some policies, how can I tell is my policies bug or my mechanism bug. And if I have the debugger system, that means I pretty much have to understand your mechanism, that makes an advantage of this coding abstraction there, not protocol. You see what I'm saying? I just want to hear your comment. How do you look at it? Because you want your systems to be used. >> Nalani Belaramani: Our system is perfect! Has no bugs! Sorry, no. That is a good question. What we -- well, because a lot of the mechanisms have a lot of reuse, and we've tried to go down all the code path, and we've tried to make sure that obvious bugs are not there. They may be some subtle bugs which we might have missed which are specific to your system. >>: [inaudible]. >> Nalani Belaramani: Any language. You can do that against an operating system too. You know, it has bugs. But what I can say, we are trying to -- the next step was maybe having a verification or a model checking system for your implementation and also an all fire run time so if there are bugs, you can figure out if it's in the policy or in our code base. >>: This is different -- I guess the real question is who is your targeted user? Who do you want to ->> Nalani Belaramani: To researchers mostly I think for right now, because it's our prototype. But the thing is not necessarily who uses this prototype. It's also about how you think about designing a system, and that's general. You know, as of now if I had to build a system from scratch, I would think about it in policy in terms of blocking. I wouldn't go down and think about, all right, what am I supposed to send through each other. We offered you a way to approach this problem of building a new system, which I think is general. Yes? 22 >>: [unintelligible] systems you look at different policies. So obvious or not obvious, is there a set of things you came up, these are the right things to do and that could help alleviate some of the problems, what is the right way to doing it, are there bugs in there that we were not able to see. Is there some story around that? >> Nalani Belaramani: The story was -- >>: I would have liked to see -- I'm not asking as hard a question as that at the [unintelligible] level. It's great we're developing systems. So which of the three directions of the four do we go with? Did we find bugs in the existing systems, did we improve the systems of the existing systems, or did we make it fundamentally easy to build new systems? Trying to answer. >> Nalani Belaramani: So the thing is is we don't have the code of the existing systems. What we're doing is implementing our version of the system over the policy. The lessons learned, I can tell you a little, like, anecdote about how the system came about. And how we actually -- I mean, the final API you see is really simple, but that wasn't the case first. So when we first started, we actually had the mechanism. The PRACTI was already ready, and it had 50 or 60 method calls, and it's just that you just call that to implement a system. And it was very difficult to build with it. Then we realized that, all right, maybe one part of it is routing. And so we took overlog and we plugged it in. And we realized that we could do a bunch of stuff, but we couldn't do consistency yet. And then we added consistency, the blocking to it. And we thought they'd be separate and we realized, wait, that doesn't work. They need to communicate with each other for some cases of consistency. And also, again, for Pangaea, the case, we had to add a way for the routing to talk to the local storage. But so we started, you know, incrementally adding all these different features and we cut down on the API size. We realized we don't need all this API, all the phony stuff. Let's just cut it down. And eventually, we came to this neat little abstraction. The difficult part in building each of these systems were actually having to look at their, as you mentioned, looking at their systems and no one understands what 23 they're doing and converting them into subscriptions or converting them into our primitives. You had a question? >>: Yeah, since we're telling anecdotes -- >> Nalani Belaramani: Yeah, sorry. >>: So I'm going to make a claim. I'm just curious to see if you believe my claim. I'm going to claim that in most complex systems out there, production systems, research systems, that kind of, you know, 10% of the code, 10% of the effort is put into implementing basically 90% of the function. And the other 90% of the code is used to deal with 10% of the [unintelligible] case, weird things. >> Nalani Belaramani: >>: Do you believe that statement? >> Nalani Belaramani: For everything. >>: Yeah. For production systems? In general, I believe that too. This work's only 10% part of building systems? >> Nalani Belaramani: Maybe, yes. But it's an important 10%. It's, you know, the problem is if you don't have that -- if that 10% is not right, your 90% is wasted. >>: Absolutely. >> Nalani Belaramani: So you know -- >>: I really like your answer for what is the target audience. this is great for research, absolutely agree with that. >> Nalani Belaramani: Um-hmm. Research. I think Yes? >>: Can you take a subset of the protocol space to prove that your mechanisms are fully general in a subset? >> Nalani Belaramani: Take a subset of the protocol, you mean for synchronization 24 protocol and prove that ->>: I mean for the final system protocol. >> Nalani Belaramani: What you may call the policy space. Okay. >>: I wouldn't ask for the whole possible space. [unintelligible] the space in such a way to say I'm only going to allow two copies of any file, I'm only going to allow blocking these cases. I'm only going to restrict my topology to fully connected networks. >> Nalani Belaramani: Okay. >>: And could you then prove that your mechanisms are [unintelligible] you could express all possible protocols in that space? >> Nalani Belaramani: You can -- >>: Because you're talking about [inaudible] checking and you're making claims about generality. So if you start going that direction, it seems like it may be possible. >> Nalani Belaramani: >>: I think so, yes. I think you can -- Have you tried to do that? >> Nalani Belaramani: No, we haven't. Yes? >>: So you reserve this huge design space of distributed file systems, consistency and safety policies. After doing this work, did you suddenly realize, look, there's this great part of the giant space that we're missing and now that I have this policy I can implement that? Do you have thoughts of the coverage of the design space of these existing [unintelligible] systems you've implemented. Or do you feel like I'm finished, we're done. >> Nalani Belaramani: No. >>: Now that you've examined this so deeply, what are your thoughts about this space and how well it's covered. And have you started thinking about where you could go -- 25 >> Nalani Belaramani: >>: Where to go next? Yes, what are your future directions? [laughter] >> Nalani Belaramani: >>: He's a plant. >> Nalani Belaramani: >>: That's a perfect lead up to my perfect directions. A deeper question. You knew what I was going to say. Future research ideas. >> Nalani Belaramani: So I do think that in this whole scenario, I mean, the systems, we're not quite done yet. There are, I think there is now a need for adapting. So there are certain points that have certain fixed consistency semantics or fixed way of propagation. I think the next part is being able to adapt to different workloads or scenarios, like in realtime. So for exam -- you know, that part, having adaptive policy has not been looked at yet. And that is where my future directions come in. I want people to actually be able to access your data from anywhere, you know, on any device, without having your data get lost. It's surprising, like, even now, like there are some scenarios which I would love to have, which we don't. For example, if I'm walking down the street and I see this flower, I love it, and I take my phone and I take a picture, and I want that data to be automatically transferred to maybe my laptop at home, upload it on my Flicker, backed up somewhere so, you know, it doesn't get lost. And we have bits and pieces of it, but there's no way to do it right now. Or easy way to do it. Or also, for example, if I'm traveling in a developing country and I'm in a youth hostel and I write a blog and I don't have Internet connection. And I meet another person in the hostel who is going home, being able to just say all right, can I transfer some of my data to your machine and if you're going, have Internet access, it will automatically be uploaded to my servers or maybe sent to my home as well, where it will be backed up? You know, having this seamless environment of being able to access your own data. 26 It is a big vision, which I would want to achieve, but maybe I'll try to make smaller steps towards it. And the first thing is having an adaptive policy, is whether how it -- whether it can detect what adapting to network environments, adapting to maybe my mobile energy requirements, if I'm synchronizing with a cloud, adapting to the cost models and having a simple, seamless interface between all my devices and the cloud. It also brings some networking issues and some security and sharing issues which I'd like to explore. Yep, and that's what my talk, and I guess we can continue with the questions and anecdotes. >>: More questions? >>: I have a question. >> Nalani Belaramani: >>: Yes? Did you find that there were any down sides to using a declarative language? >> Nalani Belaramani: It was difficult to learn in the beginning. Being able to think, like you read from right to left, and they were being -- that was difficult, being able to -- it doesn't have -- what I like about imperative is you have method calls so you know this chunk of code is doing this thing. Whereas in declarative, it's all rules, and you don't know -- when you have a bunch of code, when one event is fired, a rule may be fired up there or up there, depending upon how you've written your code. So just iterating through how things actually run was a little difficult, initially. And making sure, trying to think about which rules should be run atomically, like these rules should be fired at the same time before these rules are -- anything else are done. But that's an inherent of declarative language. But over time, it gets easier, and it eventually, once we built two or three systems, it was just much more easy to build in declarative than imperative, because we could just copy and paste the rules, and we knew this they would work. >>: Following up on that, do you think that the debugging effort per line of code for declarative code, is higher, lower or about the same as the debugging effort 27 for a line of imperative code, now that you're familiar with it. >> Nalani Belaramani: Per line of -- per rule does a lot of things. >>: You're saying, look, we've only got a few rules here, this is great. On the other hand, if it takes ten times as much effort to debug each one, it's not as much of a savings as it seems. >> Nalani Belaramani: I think it's if you take our whole program, like a whole -- instead of comparing per line, I'd compare maybe two things which are doing the same thing. That might be, because it's so much more concise. I think it's the same or maybe perhaps less than declarative, once you're familiar with it. >>: So more specifically, one of the -- with your response about the, you know, the problems with making things run atomically, imperative languages have kind of an easy answer for that. It's just you adjust those function poles into the containing function. And so I'd be curious what advantage you saw from having this layer of indirection by going and modifying data and having the change to the data trigger events. Why was that a kind of a fundamentally helpful primitive here, as opposed to then -- or as opposed to just making [inaudible] within methods? >> Nalani Belaramani: >>: Like nested function calls. >> Nalani Belaramani: >>: Making -- Nested function calls. [inaudible] functions or functions. >> Nalani Belaramani: So the atomic one was just, as a side note, it's just a matter of it's -- it was a feature lacking in overlog which we added in rover log. That's why we changed that. But we also have a java interface for this which we could program it so, you know, you're writing imperative language. It was easier to do declarative because we could actually see what's happening in the whole system in a bird's eye view instead of looking through each code. And so when I -- you know, one of our main concerns was is our system, what we're implementing, is that really close to the original system? And having a whole system, so when I implemented, let's say, TierStore and I went to Mike Dahlin and I told him this is what I think TierStore should look like, it should be. And he 28 said wait, why is it this way, why is it that way. It's just easy to review your whole code, your whole system and make sure it's following what your design is. >>: So it's conciseness? >> Nalani Belaramani: Yes. >>: What is the language feature in an imperative language that gets in the way of having a similar kind of conciseness? >> Nalani Belaramani: It just takes too long. The number of lines of codes it takes. Because a similar code, if I was to implement this in imperative language, it might take me a thousand or 2,000 lines. Maybe a thousand lines of code. >>: But wouldn't your [inaudible] rise, though? You have some functions at the top level, it could be pretty simple. Sorry. I have an allergy to declarative languages. I'm wondering -[laughter] >> Nalani Belaramani: I actually -- >>: If I pose that I have this severe allergy and I wanted to take your principle of policies and mechanisms, could I do that in an imperative language? >> Nalani Belaramani: >>: Um-hmm, you could. Or you could get shots. [laughter] >>: So with respect to your future work, you kind of envision this world where everything is being shared on everything and ->> Nalani Belaramani: >>: It's not being shared on everything. Some things are being shared on some things. [laughter] I guess -- 29 >> Nalani Belaramani: That's what's happening now. to have control where my data is going. I mean, as a user, I would want >>: Sure. Suffice to say the degree of share-itude, if I may call it, is higher this in world. >> Nalani Belaramani: Um-hmm. >>: So my question is what is the pricing model for that. In your example, the youth hostel, you would meet at a youth hot tell. I'm like, yeah, I'd love to upload the picture of flower you took on my phone. Am I paying for that? If you sent me an eight meg picture of a flower and I'm being charged by the bit, all of a sudden, maybe this is a bad hostel experience. [laughter] >>: So I'm just wondering, seems like there's a lot of, so on the research level, it's kind of cool to have some things being shared with some things, but how do you -- you know, what's the user's impetus to opt into this potentially very costly sharing arrangement? Have you thought about that at all? >> Nalani Belaramani: So an advantage could be I could give you some money in return or a promise to give you dinner. I haven't really -- the cost model is, it's highly subjective. I mean, in some sense, you don't know. It depends on your network. Depends where you are. It could be that your Internet is free and so even sharing doesn't happen automatically. You have to allow it. So I'm hoping that if you know you have to pay for your bits, you would ask me for dinner to help you share your photo. >>: Okay. >> Nalani Belaramani: Yeah. >>: So I really like your talk. You did a great job of selling this approach. saying that partially to apologize for the next question. I'm [laughter] >> Nalani Belaramani: >>: Okay. Could you go back to your future work? This is really an unfair question. 30 >> Nalani Belaramani: All right. >>: So, you know, Mike Lorenzo started this project called [inaudible] a bunch of years ago, where they were looking at trying to make your, all of your management of your personal devices as data zero overhead. >> Nalani Belaramani: Yeah. >>: And the project really didn't go very far. So I'm wondering, you know, with your vision, what has changed or why will it work this time over why that project seemed to be too hard or maybe wasn't ready or the infrastructure wasn't ready? >> Nalani Belaramani: I think that's what I want to say. We have PADS now and it solves everything. No. The thing is we actually do have an infrastructure now. It separates your implementation from policy and now all you need to focus on how to get the policy right. And having to do adaptation in the policy level without having to worry about the mechanism. And so ->>: So you think that it wasn't possible to implement that vision before, because you'd have to write too much code, because you didn't have PADS? >> Nalani Belaramani: >>: Yes. That's pretty bold. [laughter] >>: That's what we like in our candidate. >>: I didn't say that was a bad thing. >>: All right, thank you. >> Nalani Belaramani: Yes? >>: So what is the main [unintelligible] of looking at those systems. Can I go with a nugget saying we were doing a bad job initially and we're doing a better job now. You look at the [unintelligible] goal of delivering good performance, [unintelligible] system design and letting people use it. What should I take away 31 after ->> Nalani Belaramani: routing and blocking. >>: So take away from this is everything can be separated into I agree. >> Nalani Belaramani: That's it. Sorry? >>: What does it buy, apart from [unintelligible] code? Does it improve? Does it let me build new systems? Does it improve the performance of existing systems because now I have more time at my hands to spend? >> Nalani Belaramani: What you can do, it actually does, I'm hoping with this experience it proves to you building new systems would be a piece of cake in blocking and routing. >>: Can you say that we were doing a bad job earlier? that? Are there examples to show >> Nalani Belaramani: There are hundreds of thousands of lines of code which we spent in our previous years for the 20 systems were a bad idea. Sorry? >>: Do they work? Are they broken which you found out while -- >> Nalani Belaramani: The systems we built, we actually did testing on them in which we did all the failure models. If it's a client server, we failed -- we got the client -- the server died and the client could recover. So they work. It doesn't not buy you anything. They fully functioning systems. And so instead of what it buys you, instead of spending a thousand lines of code or years of effort, you do it in a hundred lines of code. >>: So maybe I can ask a question that you're too polite to ask, which is if you build this framework, right, I'm always really scared when I read a paper where authors construct this great model that allows you to do research and then they don't use it to do anything, and then they toss it over the wall and hope that somebody will pick it up and run with it, because no one understands the model better than they do. And if it's really a productivity enhancer, then they could get productivity. And so, you know -- 32 >> Nalani Belaramani: >>: Have you -- Have like the access personal data anywhere paper. [laughter] >> Nalani Belaramani: Yes. >>: Your talk is a slam dunk if you say, and by the way, all of these protocols are benighted and we've created this uber protocol. So I guess the question is, why didn't you take it to that next step and create a better policy since it's so easy to do in your environment. You didn't have time? >>: They did. The three with the asterisks were new policies. >>: Let her answer the question. >>: Sorry. >>: I mean, were they fundamental improvements? that's -- Were they tweaks? I mean, >> Nalani Belaramani: Well, if they improve your performance in certain scenarios, I mean in scenarios in which you need performance improvement, I would consider them not tweaks but fundamental improvements. They help you go into a different area of the design space. Yeah. >>: Okay. So with your asterisks, how much did you improve those systems? >> Nalani Belaramani: So for example, for Bayou, if you were just interested in 10% of the data that's stored on the server, you only get 10% of -- your network bandwidth is reduced by 90%. In Coda, if your latency to your peer is ten times -- I mean you're connecting to your peer's ten times faster than connecting to the server, you get your data ten times faster. Your read latency's improved. >>: Great. Thank everyone for helping answer the question. >> Nalani Belaramani: >>: Yes, thank you. We should probably wrap at this point. Thank the speaker. 33 >> Nalani Belaramani: Thank you.