>> K. Shriraghav: It's a great pleasure to have Sharad Mehrotra here from UC Irvine here. He's here the whole of this week. If people want to meet him one-on-one, we have time for that. So cloud security is a popular topic, but Sharad is ahead of the curve. He thought about this problem ten years back. He won an award in SIGMOD and he's here to tell us today about more on his recent work on security. >> Sharad Mehrotra: All right. So first things first, to lower this down so I can see. All right. So what I thought I'll do is give you a sense of where we are going in terms of the project and the research in this direction. But before I do that, it's a small -- and I say I have over 90 slides or so, if I really want to go far. But we'll do a very small subset of it. But I don't know if everybody here knows UCI that well or not. But in any case, here's the recent picture of our group, which is the IS group. Dave is intimately familiar with this group in the sense he visits us every year. So he's like one of our constant visitors. There should be a badge for him now or something, a room especially for him. Okay. So there's three interesting things about this picture. Okay. One of them is that this is about five years old. If you come back to UCI now, the faculty is the same, the students are pretty much different. But there are two interesting aspects. You can tell me see if you can identify. One person is PhotoShopped in. But he really wanted to be in the picture, couldn't be in the picture at that time. So he was PhotoShopped. You can make out from the size of the head. If you look carefully, it's proportionately not correct. The other interesting part of the picture is the guy whose lab PhotoShop was built in, he's also in the picture. Who is he? When Ramesh was at Michigan, PhotoShop came out as a product of his lab itself. That's three interesting things. All right. These are the faculty we have in the ISU group. Mike Carrie, everybody knows him here more or less. Ramesh works in multimedia. I work in data quality, privacy, security, lots of database, general purpose database, whatever, X. Chen Li. He works and has done a lot of work in integration Web search. He tried a company the middle of all of this and finally back again. Natalie works in distributed computing and in middleware technology. And Dmitry, he works so far very closely with me and especially databases and data quality. That's the group. I wanted to give you a sense before I start on the talk on today's talk about what we are doing and how this group kind of comes together. And this took a long -- this slide took a long time to make. That's why it shows up. A lot of jumbles around here. But that's kind of like the big data stack that we're aware of at this stage. And most of the work that we do, as different faculty, can be put or cost as essentially in this particular framework. So, for example, ACCESSdb as many have heard of, attempting to be -- this is Mike's project. He likes to say one size doesn't fit all. And sometimes what he's hoping to achieve with this is at least one size fits most. So coming up with this next generation of essentially big data framework, which is ACCESSdb. And the other Project Sherlock which I won't talk about today. This is one of my projects where we're trying to look at data quality challenges in the context of large data and big data. So we just started to launch this new effort of this direction, if I get -- I'm hoping to meet some of [indiscernible] people who are in his group, hopefully a bit more interested in that topic as well. All right. The next large project which is a collaboration between LD and myself is on essentially adaptive data acquisition. So in the systems -- so this traditionally goes back if you look at standard databases, data acquisition is something which we normally do not worry too much about. We think within databases data comes in and we start working with process query optimization and all that stuff. We do a wonderful job after all the data is in. What this project is trying to look at is can you make the acquisition to be one of the main centered projects inside databases itself and what will the changes be if you look at data acquisition as a fundamental important kind of component or data management itself. In particular, if your data is spread in diverse sources, sensors, whatever it might be or basically Web sources, and you have so much bandwidth constraints, whatever it is, that you can only access some parts of the data, how do you go around doing that appropriately. That's the larger picture of what the project is trying to do. I'll talk a lot about Radicle, which is the cloud security a product I want to talk about today. So I'll skip past that. And a lot of vertical efforts in this data stack kind of picture which are looking specifically at things like IoT. So there's a bunch of us working towards essentially the designing systems for Internet of Things kind of stuff. And also there's a lot of interest, especially in the case of Ramesh, in the social media. If you look at the full pipeline of storage, modeling, representation, all the way to applications of social media applications, that's where there's a lot of effort in that direction as well. Okay. So that hopefully gives you a sense of the things that are happening in the group. Okay. The product I'm going to talk about today is Radicle project. And the idea here is we are trying to explore data processing frameworks which essentially are meant for the cloud environment and they try to exploit essentially this concept of hybrid cloud, which I'll explain a little bit more about as well. In particular, they try to exploit partition computation as well as encryption technologies to be able to provide you with a secure data processing environment. The fundamental idea here is to play with all kinds of things such as generality of applications, risks and confidentiality and risks basically associated with data processing and so on and the usability of the system. And hopefully by the time we are done with the talk, we'll see snippets of examples of like how these data can be played off in different settings. Okay. So back to the introduction of what the project is about. So if you look at cloud computing, a public cloud, it has more or less emerged as the new home for basically data. This slide is more towards personal data. So if I think of ourselves as end users, we'll use Gmail, Google docs, calendar applications whatever it is, mostly now everything is on the cloud itself. So most of us essentially use Gmail or some variant of that or the other which is where data resides in the cloud itself. The more interesting slide is if you look at from the enterprise perspective, because for a personal perspective, cloud makes a lot of sense. The interesting thing to me at least was this survey from Forrester, basically report by Forrester, done in 2012, results of 2012. And from then onward it's projected. But I think it's come to pass more or less. And here basically what Forrester did is they asked a large number of people who are IT managers and so on or makers and they asked about 2200 such people in major companies about their plans for cloud adoption. And they were looking at from the perspective of infrastructure as a service versus platform as a service versus software as a service. And if you look at it, the expectation at least for the software as a service is around 60 percent in 2014. If you look at the least, most difficult adapted from a service largely. So that's even around as high as 40 percent or so. So there's this enterprise is being interested in actually using the cloud versus to me it was an interesting thing. Even for enterprise it makes a lot of sense. Well established enterprises makes a lot of sense. If you think of U.S. companies versus Asia-Pacific, I think the only difference in this as far as Forrester was concerned was Asia-Pacific was about 12 months behind. So the projection of the figure of that was also similar, except they were following a track of around 12 months behind. Actually, recently I was in China and I came back from China. And what I'm told by those guys is that's actually changed. It's no longer they're behind. They're pretty much at the same pace. So here's another similar slide from another source, which is basically if you look at the expectation, very soon from now, if companies invest a lot of money in their IT infrastructure, a very large portion of that is going to go into the cloud, essentially. Okay. So the question is valid why public cloud and this group is not much reason to emphasize that at all. But let me, for the sake of completeness, maybe there's somebody listening over this thing as well, remotely as well. So cloud offers lots of advantages. One of the main things is utility model. You essentially pay for only what you want. You do not have -- at the start of the business you don't have any infrastructure cost at all. There's a big advantage. Similar kind thing versus rent versus basically buy or lease versus buy. So leasing often is an easy thing. You don't have start up money, you can lease these off. Another major advantage is elasticity. If your demand was up and the need was up, you can get more resources. Potentially limitless set of resources available on the other side. You can scale down if you so desire to scale down. You don't have to manage your own systems at all. This is a very important aspect as well. And probably from an individual person, cloud adoption perspective, it's probably one of the most important advantages. I don't have to manage my own resources at all. Then there's cost optimization because of economy of scales. And hidden inside this to some extent, all the advantages, is a subliminal message about what the challenge also is. The challenge in terms of cloud adoption is basically your only worry is loss of control. So maybe that's an exaggeration but one of the main worries is loss of control. That happens. So if you're using the cloud model, your data and your applications and your computation is now running outside of your control basically in the cloud provider's control. There are many reasons why this leads to, which are many factors that lead to this loss of control. Cloud is a shared resource. That's the advantage of that. That's the advantage of where cloud comes from. So now your applications are running at the same time with other people's applications which you might or might not know, might or might not trust necessarily. And even if that's not the case, if I think from a confidentiality perspective, the cloud environment is susceptible, it increases all in some sense, because hackers they can go in the cloud, hack into the cloud, get into a major bank for the money. In some sense, even though the cloud may have a significant security perimeter around it, the chances of people trying to attack it increases as well. The more dangerous part of it is the insider attacks. And also the fact that the jurisdiction, the issue is the cloud might sit anywhere else outside the jurisdiction boundaries of where you actually are running your business. That's a major issue as well. And finally, which is an issue that Snowden made very popular or visible to all of us is, this whole effect that when you, if the cloud provider based on subpoenas can actually be pushed or forced into sharing information and data to possibly the detriment of the donor himself or herself. So there's always this the danger that the data is not necessarily in your control. It's not in your world, not inside lock and key. Now it's somebody else and somebody else can do whatever they want with it. This loss of control has lots of implications from almost every aspect of the system design. It affects availability. And there are cases of example where basically data was not available stored somewhere else in the cloud and that caused a disruption to businesses as well. There are examples of integrity as well. Yesterday we were talking about integrity. There are examples of basically loss of integrity leading to a problem. So there are examples of this as well. The part which I'll focus on, that we're focused on in Radicle, is largely on security and privacy and confidentiality. And this is touted as one of the major concerns by most respondents for cloud adoption. So they're worried about security, confidentiality of the data. So the question that comes up is, okay, security is compromised or confidentiality might be compromised. But the key question is whose responsibility is security? Is it the responsibility of the person who owns the data is the responsibility of the person who is running the cloud or joint responsibility, whose responsibility it is. To some degree, the answer is visible. And this is a slide I think borrowed from one of you guys, your presentations. Is written in the policy that basically, and the advice in this particular case which AWS has. And the yellow part which is highlighted clearly states what Amazon is telling you, okay, if you want to do something sensitive with the data make sure you encrypt it or figure out security and stuff. Cloud providers are not necessarily willing to take on this responsibility of protection of the data and for good reason because it's very difficult to protect it. Here was a slide which I thought, when I looked at this first, I thought it was interesting. So, for instance, they do a lot of this stuff. So they kind of -- they asked people who are basically the IT administrators so on, makers, they asked are you aware the security is the responsibility to these people from companies. And the answers could be yes or no. I don't care. The third answer. But I guess they pushed their answer yes or no. Here's the interesting question. What do you think if you're not seeing this result before, how many of the people actually said yes and how many said no? What would you expect? >>: Most people said yes. >> Sharad Mehrotra: So that -- he's too long with the serious. He's absolutely right. I was surprised by this. I would think most people would say no to this question. But people have a realization that security is your own responsibility. So they're not kind of counting on, they're willing to use this in the cloud, despite the fact that cloud does not offer security, which is okay from an individual perspective. To me this is from an enterprise perspective. Enterprises are okay with this. So even though there's awareness, tools today lack basically the power to enable users to protect the data. There's a need for protection. They realize that. This is a big barrier to adoption of the cloud and yet there's no technology. Just kind of being able to support something like this, the issue of confidentiality for data. And given that this is not a cloud responsibility now the tools have to empower the end users to do something like this. So what's the answer? Well, one answer which is straightforward is encryption. So you encrypt the sensitive data before you upload it to the cloud. And if you do that, there are at least two models that come to mind right away. The first model is I'll encrypt the data stored in the cloud and then when I need the data, I'll get the data back into the client, decrypt it, appropriately, do my functionality. So using cloud essentially as storage. The other model is you encrypt the data into the cloud. You try to do whenever you have an application, you try to do that in the cloud itself as well. And then you look at the results and you decrypt the results. Hopefully you're using the cloud computationally for computational purposes as well. In the first model, basically you're only using the cloud as storage, limited utility of the cloud itself. So we should probably strike it out. I only want a secure disk. I want something more than that, as well. So the answer is the second approach, which is what we all have been struggling towards. And the question is at least with this approach you can utilize some power to the cloud from computational perspective. There's been in the last 15 years a significant amount of research on how to enable encryption in such a way that you can use it to computation on the cloud side as well or encrypted domain itself. What I'm going to do, before we go into the Radicle solution, which is what we're talking largely about I'll take a slight detour around 5, ten minutes to give you my view of where the work in the encrypted computation has been and where it's headed and so on. Okay. You can go to sleep because some of it is borrowed from -because this is borrowed from you guys only. To some degree it's borrowed from your slides. But I think it's reasonable to set this up before we go, why we're doing what we're doing in the context of Radicle. Okay. So to some degree, the first set of works that started this whole area of computing on encrypted domain in the more recent times, this area has been on for a long time in reality. But the one that kind of rewired, it was a slightly more modern view of this broad problem was this work by Unsearchable Encryption by Unsong [phonetic] and so on. Appeared in SNP 2000. The idea they wanted to be able to store documents in the server side in the encrypted representational documents and be able to retrieve keyword searches on these documents itself. The idea was very powerful. What they did, every word in the document, they would generate a random string and hide a trapdoor for this particular word inside the random string. That would be my representation of encrypted data. Essentially randomized strings with some trapdoors. When you want to query, you essentially send the trapdoor, evaluate and power encryption that you can check if this random string that you have got represented actually stores the trapdoor that you're searching for. If you can do that, then you have an easy way of testing whether or not a document or the word corresponds to the word you're searching for now or not. And you could retrieve documents. The first one that was built. So it supported essentially keyword searches and keyword documents. If you think about it, essentially every word has to be checked. If there are N documents and each document contains let's say D number of or D documents and the end word is document the complexity of this is N times D that you have to check for trapdoors. It's not indexable or efficient and so on. People in the encrypted community they went around trying to figure out how to make it more efficient. First thing they did was used broom filters and got rid of the size of the document dependence, make it linear in the number of the documents. But from indexing perspective that's not very good. The other set of people who went around [indiscernible] and gang, maybe we can get help from the client, construct appropriate indexes and so on in particular inverted lists and use an obvious traversal of the lists, so you can actually do better than that, you can get it in subnet time. But every particular technique that came out since then, since the starting paper, has had weaknesses and goodnesses and weaknesses around with it. The other piece of work, which looked at the problem as skill perspective, was work I was also a part of, which took this whole concept of instead of keyword retrieval, it's more SQL retrieval. And the idea of this is straightforward. We have relations. The way we do the relation representation is I will identify fields that are searchable. Let's say I want to have queries on age and salary and so on, searchable fields. For each searchable field I'll create a cipher index. The cipher index appropriately encrypted and follow appropriate, let's say, encryption technique. And this will be padded along with the actual data itself. And now when the query comes in, you will exploit this cipher index you've got to answer as much of the query, to evaluate as much of the query as you possibly can on the cipher representation. And when you hit a boundary of encryption you cannot do anything more. You'll push the computation back to the client and we'll start computing and do the rest of the query processing on the client side. We did work in that direction. So essentially there are two fundamental ideas here in terms of how to do encrypted search. One was essentially to exploit as much cryptography you possibly can to do as much of the work pushing as much of the work you possibly can to the encrypted domain. And the second aspect of it was partition computation where you basically can continue work on the client side as well. Now, the question is what can you do in encrypted domain? Lots of different things. If you have deterministic encryption, for example, you can essentially do point queries quite easily. Do joins, so on and so forth. So I'll skip this. Each technique we end up discussing will have caveats and weaknesses. For example, if you do deterministic encryption, if I know the distribution of the data I can pretty much guess the data itself. It's not exactly fully secured in that sense. And another innovation that came around at the time was OP or representative encryption. The idea, it's borrowed slide from you guys only, the main concept if you have in your plain text X less than Y, then the encryption of X is an encryption of. You enforce. How do you do this, hundreds of different techniques for achieving something like this which people have developed, on other present encryption. It enables you to do researches quite effectively, but the problem which is obvious, the first thing adversary learns the order between things. In particular, adversary knows the domain. The possible value, pretty much knows everything at that stage. My favorite example if I order encrypt, let's say grades of people ABCDE or ABCDEF and that's it and know A is more than B, B is more than C more than D. I pretty much might not encrypt because adversary gets ahold of it at this stage. Grades cannot be anything other than ABCDEF. And it gives it away completely. Now this problem has been -this is the thing we talked a little bit about yesterday as well. A idea of model encryption. There are many ways of model encryption. One of the projects is that in this case the highest value corresponds to the highest value. The second highest value and so on. So starting position of everything is known. It's not just the orders but even the starting position is real. So you can overcome that problem using basically mapping this to a model or domain, using model mathematics. One of the ideas was imagine the original one, two, three, so far, and choose any OP technique to map it to some other order domain. Then what you do is you take essentially the corresponding representations. Now in this model representation what you'll do is you'll take let's say one and instead of representing it as OP of 1, you add an offset and this offset is secret. So secret offset. So let's say the offset is 2. So the one will be the represented not as OP of 1 but represented of OP of 3 and so on. When you reach N you wrap around. Right? So the interesting part here is that in this theme, the order is preserved completely. So if X as in Y you can test whether X is less than Y or not. But at the same time the starting position is completely hidden, hidden in this parameter J. Because unless the adversary knows J, you cannot figure it out. Yes? >>: [indiscernible]. >> Sharad Mehrotra: >>: Say that again? [indiscernible]. >> Sharad Mehrotra: Because when I do the mapping, if I query the set 2 to 4, I'll map 2 to 2 plus 2 is 4 to OP of 4 and OP of -- if the query was up to 4, OP, the OP would correspond to OP of 6 because 4 - 2 is 6. OP of 6. >>: But the model that you're [indiscernible]. >> Sharad Mehrotra: around. >>: You'll wrap around. Get more and you'll wrap [indiscernible]. >> Sharad Mehrotra: Capacity not big. So strictly speaking, this is not OP representation. But there is enough power left that you can wrap around. Query might be starting from N minus 2 and going up to 2, for example. So you could do that. Okay. But again is this secure? Well, first, the security of this is no better than security of OP itself which has its own set of problems. But the starting positions, if you talk about just the starting position, is it secure from the starting position perspective? Can the adversary figure out the starting position? The answer is yes or no. Yes from ciphertext if it's 100 percent secure because the starting position cannot be detected by adversary in this technique at all. But on the other hand if I allow queries from the query pattern you can actually figure out with reasonable attack you can figure out what the value J is going to be. >>: [indiscernible] question. Going back to the example. create on a curve and I know you give 15 percent -- If you >> Sharad Mehrotra: Yes, so sensible attacks, forget sensible attacks for a minute. Let's assume the benign situation that everybody, the bins are all equal and extreme situations. It's a bit more secure than not doing it at all. Let's put it this way. The modularity adds the -- if you forget about the extensible attacks of that kind, modularity prevents you from obvious attack of OP, highest value here, highest value there. That's no longer true because highest value depends on where J is. >>: Both are insecure. >>: Both are insecure, of course yes. >>: Even order preserving, if the same thing is crypto the same other thing, why is it any harder than the crypto we all grow from age 0 to age 2 as babies? >> Sharad Mehrotra: It is not -- I think the point here is that OPE -the position required, absolutely, I think OP is insecure technique. I'll offer you the next one, which is something we did. Let's see if you bring that one. It's also the wish, true, but I think it's a bit more secure. The problem is we've not formally proved it. It appears more secure. Let's have a look at that. So what were we doing, we did not do OPE. What we did was the concept of bucketization, in that piece of work, and what bucketization does is very much follow database principles of histogramming and the idea was that if you have a domain you look at the domain, in this case the domain is salary, and break salary into bunch of buckets. With each of these buckets I am associating essentially a deterministic encryption technique. So the first bucket is 32 to 50K. I'll now have a deterministic encryption of one or basically this idea, whatever the bucket ID, is it corresponds to a data encryption bucket ID. This is what we would do. We would store this stuff in. The advantage of this scheme is the following. I guess the way the processing happened since you bucketize the domain, in this case you cannot exactly check the query at all. So if I want to look for, let's say, the inch query or point query of any particular thing, find the corresponding bucket and do a deterministic, since it's deterministically encrypted, I'll have to go, retrieve the whole bucket. Once you retrieve the whole bucket, the clients work to filter the bucket out. Now, you can look at this from the advantage and disadvantage perspective. The advantage of positive of this was this is very general. You can actually do almost all of SQL including parts of aggregation as well, using the simple idea, storing appropriate counts and so on. Very general. Do joins, do point queries and bench queries and so on. It is efficient because it's fundamentally indexable. So you pretty much didn't have to change the database processing in that sense. You could spread the optimizer. And another advantage it added sliding scale security. If you want complete security, I declare one bucket. This is kind of silly. But on the other hand it's security. So there is a sliding scale security. The negative was they were all ahead. You have to do post processing in this case and learn part of the query in the private side as well. And depending on how you bucketize, ciphertext will reveal some information because there's some value of knowing, hey, these two values are close to each other, like they belong to the same bucket that would give away some value, some information away for you as well. So there were advantages and disadvantages. But it's playable. And the key question of security, dependent on how you can move the buckets itself, whatever the buckets are generated with. So if I look at it and then we could do a mathematical analysis of the same thing, larger the span the bucket, typically you'll have larger security. Okay. So security metric was larger would do better. The bucket size, larger the better. Frequency distribution in the bucket, the more uniform it is the more information that's been hidden away. Uniformity was important measure as well. Cost metric was how many false positives do you actually generate, given a query what's the size of the false positive you have. And the key issue which I thought was interesting here was that it actually provided us with a mechanism of improving security by adding bonded randomization I'll not talk too much to it but I'll give you an example of it. >>: When you're talking about buckets, you said more uniforms are better, is there an issue of the amount of things that are in the bucket or can you vary bucket size to give you an appearance of uniformity? >> Sharad Mehrotra: You can vary the bucket size. You want that to prevent statistical attacks badly. So, in other words, if it's a distribution the SKU can give away the bucket ID. >>: In some sense push it out, make all the buckets the same size. Then you might hide that. >> Sharad Mehrotra: Yeah, so that's what we'll do. We'll make the buckets equal size. So the interesting part of this entire scheme to me was, which is still not fully explored, we've not done a good job exploring this appropriately, is we could add a certain amount of bounded amount of let's say randomness to the process. So in particular, one particular way -- there are hundreds of different ways randomness can be added to improve security I think but one of the ways which we did try in a paper we did, we'll be general, was we said, okay, if I look at a bucket, initially have a bucket identified for you -- let's say this is bucket one. I'm going to toss a coin and I'm going to throw the objects in this bucket either to its bucket representation itself or some randomly identified bucket. This mapping of which bucket could, the content of this bucket would diffuse into is fixed. It's a secret. So objects in bucket one may reside in bucket one or they could actually go to bucket four or bucket three or whatever it is. So essentially adding a little bit of randomness to the data itself. You're diffusing the data into different buckets and so on. So if I finally look at what does bucket four look like, it could have data from anywhere. Not anywhere, the answer is secret. But at least from a large amount of data space itself. And the same is true for all the different buckets. What this does is it increases essentially security for us. Okay. Obviously it's not free. Because now we'll have to do a search. If your query revolves around bucket one, you have to go to all the buckets where the data could reside in. Right? So but on the other hand it's giving me a practical approach to going around adding more security. In fact, completely diffuse the bucket to everywhere then it's back to square one, which is there's no prunability left of the bucket itself. But I can control the amount of basically security randomness. This technique of adding randomness in the context of partially secure techniques is an interesting direction which is relatively underexplored the way I see it. There's not too many examples of this but it's a wonderful direction kind of to move forward. All right. So sorry took a little more than five minutes but here is where -- we wrote a couple of papers basically high level papers identifying what are the different techniques for basically searchable encryption and so on. And I think forget the details. The most interesting aspect of it was if I look, there is no final like silver bullet you can solve everybody's problems. You can evaluate these different techniques from different perspectives, generality query, confidentiality that you get, whether the thing requires you to have a cloud versus client workplace issue, what's the efficiency of the technique, how much does it depend on trustworthiness of the infrastructure itself or not. And all techniques kind of fall either as points or set of points in this particular space of some kind that we generate. So you have a large number of solutions out there, but none of them is complete. They explore different tradeoffs within generality, security, efficiency. And this is again -- this same point is made also in your tutorial which is the slide from your tutorial which can identify the same thing. Okay. Now this has not stopped people from building, the project is so important even though there's no silver bullet, there's no exact solution to this stuff, people are building systems already based on the topography that's already out there and available to you. There's many examples of it including the work going on at Microsoft, this group CryptDB and cipherbase and so on and so forth. And, by the way, besides these I've had a chance to look at the system at SAP, at NIST, which is -- NIST is implementation of our work. SAP is more like CryptDB. A implementation of CryptDB and Lync is influenced by CryptDB as well. There's a lot of work in this area that has taken off and people have built systems and so on, explored different options. Okay. The key issue through my perspective is the following. If I look at the modern trend here, most of the systems that have been designed, they essentially offer more security having different functionality. So being a tradeoff between the security and functionality. That's where the ballgame has been to a large degree. So there are many challenges even if you look at encryption encrypted data management using encrypted data representation for security. There are many challenges that remain. First, obviously there's no technique that's a silver bullet. No technique of complete security and that leaves a natural question. If I am having two different techniques and three different ways to do the same thing, which is better? Which is more secure? And this is not an easy question to answer. You have to go into the depth of somehow the other modeling the risk. How much information is given away by deterministic encryption versus OP versus some of the other searchable encryption and so on. How do you measure something like that? A key question which is there, which is open. Again, I mentioned this already current functions functionality versus security. I think the more interesting from a system development perspective is not that question. The more interesting question is functionality you cannot compromise on. Either you want it or don't want it. If you don't want it, don't buy the system. That's okay. If question is if I want the functionality. The real tradeoff should be between basically between the efficiency of implementing or realizing a functionality and security. And the third thing is most current environments in HP use these systems. We should never ignore the power of private machines of basically secure, either secure hardware or basically the machines that you may have in your client side itself. So you don't want to ignore and solve a problem as if the data is completely in the cloud itself, in a completely untrusted environment. There is actually an availability of trusted infrastructure which you should be able to exploit to your benefit. So question is can data computation be partitioned to explore this secure execution environment you might get. >>: The secure environment, would it be the client -- >> Sharad Mehrotra: way. >>: The system could be both either General vision itself? >> Sharad Mehrotra: >>: It could be both. We'll see the differences come up. Okay. >> Sharad Mehrotra: Okay. All right. So that vision has never been fully realized proper system. That's one of the problems. We didn't build the system which you should have. Okay. So Radicle project is all about trying to do the same thing. It is a review of what we did in DAS essentially, but with an eye on two things. The first is can partition to achieve security and the second is can I now at least formalize what risks are and exploit risks? So bound the amount of risk a particular execution has, a system particularly has. It's not to replace encryption it, it's to complement it, rather. If you can make good progress on homomorphic encryption, make it practical and do some of the special encryption stuff, that's great. This is meant to complement it. If you can completely solve the homomorphic encryption problem, fine, we'll back off and do nothing, you've solved it. But I don't think it's solvable in the next few years to come, 10 years to come, 50 years to come. So I think we're okay. >>: Solvable by one -- >> Sharad Mehrotra: Maybe. But [indiscernible] all right. So in Radicle -- so this vision of risks, controlling risks and controlling using partition computation. We've built many example systems, example systems of this kind. One of them is called CART protect. And here the idea is to build a middleware which what it does it's like a middleware that sits between your Web browser. It's meant for providing secure access to applications. So there's DropBox. There's Gmail, whatever you have on the other side, and there is this browser to which you access this particular information. And what it does, it's a proxy-based architecture, sits between these and the rest of the bad world and selectively encrypts data for you. The encryption data is such it plays a game of trying to make sure, when you encrypt what happens, when you encrypt the data, some of the functionalities which you could get in Google, for example, are no longer available to you. Now, if you wanted functionality, what do you have to do? You have to get the data back, decrypt it, send it back, get your functionality and implement it. English translation, you can do that but you have to bring it back. It interferes with the usability of the system. The latency of the operation takes a long time. What it does, it's sitting quietly on the side, looking at the log of what you're doing automatically adjusts what should be encrypted, not encrypted, to strike a good balance between security and risk of exposure. >>: So what risk and other dimensions, for instance, you could lose your key in which you've lost your data. Which is not relevant ->>: Sure. In this case the assumption has been the key is to safeguard of data on a local machine. The proxy sits somewhere and it guards your key whatever. You could push the entire proxy and in reality never achieve that. It's very light so it can run on a mobile machine as well and it should be able to store all its data into the cloud, which is never doable I think. But we didn't do that. >>: Can you imagine this being say part of ODBC, an ODBC framework where all database accesses go through the ODBC client side and do a little encryption there. >>: This is calendar, right? The client has the calendar to face on the database. Doesn't see SQL or anything. >>: But there's no logical -- I agree with that. Our current version answer would be no. This is not just calendar, the cloud protector has been used with Google calendar, DropBox, with Box, with Picasa, lots of different services. They've been built early. At some stage all this was working. They exist in a point when these were working at some state. It's a pretty general architecture. There's no logical reason why we cannot even do data processing and SQL-esque kind of stuff with this. >>: The client, you have the JavaScript and the proxy needs to implement the same interface as the server. >> Sharad Mehrotra: Yes. So this is under the assumption that interaction is through http requests proxy. The proxy looks at the http request and modifies it based on the encryption mechanism. Proxy is maintaining how the data is represented on the client side on the server side in which particular fashion. It maintains full knowledge of that. Some representation. >>: But it needs to understand the semantics of the interface and that's why the ODBC approach will not work because it needs to search, it needs to do some rewriting. >> Sharad Mehrotra: I see your point. In this case it does understand each form, which is out there. So it understands the semantics of the form. What does it contain to the form for each of the http requests. Corresponding http requests, it has to understand that. The second system which I'll also talk about, we can separately talk about later, is hybridizer. The idea is simple. I want to have the cloud to run hybrid queries or SQL queries on hybrid infrastructure. Part of the machine is here, part on the cloud side. The goal was, given the workload you've got, figure out how should we partition data and partition computation. And we made a assumption here. Unit of partition is a query itself. Query executes on the private side or the public side. Not both. That will complicate things. It tries to figure out the best way to partition data and led to the problem, optimization problem, that's the hybridizer framework we built. The one I will talk about is SEMROD which is a secure MapReduce technique. And the idea here is so in a sense this is all about SQL and we wanted to go one level lower first. For two reasons. One we said okay, I know how to play the game at the SQL level to some degree. Let me see if I could build an infrastructure where I don't have to worry about SQL at all. Let me do the demand level and then figure out how to run MR in a reasonable fashion, in the hybrid cloud environment. Then take it to high and convert to MR and it will run. Basically this will be the lowest possible, let's say, level at which you can exploit the secure processing itself. SEMROD today. So I'm going to talk a little bit about >>: Can I insert the SQL connections. So you could use your functionality as a way of partitioning the data between -- you could also have some reasonable partition into the cloud between local and secure cloud based perhaps on sensitivity of the data, whatever. And treat the problem -- let's say a job of partitioning the query in some way. >> Sharad Mehrotra: So the partitioning is completely based on sensitivity. You know what's sensitive and not sensitive. And you know which query accesses what data. You have workload available to you. What you're trying to figure out is okay not just how much sensory data resides with you on the private side. It's a risk-based model. You can actually -- if it's small amount of data it's less sensitive but used very often, it will push that sensitive data to the public side, including a disk. But the system will allow you to automatically adjust the amount of risk you are wanting to have. Now, it doesn't answer the harder question like if I do OPE and if I do deterministic encryption, how much risk is there because that we don't know. The minute I say here is the data which I'm exposing to you which is sensitive, I'll count that as risk. Right now the model of -risk 0 or 1. And data encryption is .5 .6 .7, is that the way to look at it, different issue altogether. >>: You had a notion of partition, so can I thing of hybrid viewing the partitioning using risk as a mechanism for the partitioning as a new way of ->> Sharad Mehrotra: Yes, you can look -- in this case the partitioning is at the workload level. You take a whole workload and you first come with a data partitioning, given to suit the particular workload. When the actual execution happens, the queries come, you can partition it accordingly. Similar. The only addition being that the risk is factored in as well. So on to SEMROD, a new work which we completed. Okay. So if I look at one quick sense of hybrid cloud, private side is secure, because it's in your control, rather secure, I should say control. It may not necessarily be secured, but it's in your control. Public side which you'll all agree is efficient, is cheap and scaleable, elastic, all the nice properties we have from the public cloud environment. And the hybrid cloud is a seamless integration of both of these things. You can run the application on both sides, on public and private. And the goal is highlighted on the Beckman report. Maybe it's there we were there, and I participated in that thing. Maybe that's when the line made it. But anyway it's not in the Beckman report. It's an opportunity to achieve -- and it is an opportunity to achieving secure and efficient computation in a cloud environment. >>: [indiscernible]. >> Sharad Mehrotra: they're ->>: Funny word. They prefer it. >> Sharad Mehrotra: >>: So the way they prefer, I'm not sure what sense Right. Prefer it. But I'll make -- I have two things to say about that. I thought maybe you had. >> Sharad Mehrotra: I don't have the numbers. I have one thing. We were forced to do the experiment and make it to be more accepted. In the Bay Area, from nearness or near the data, major datacenters and so on, where the stuff runs, right, they're providing and laying out, lets say, fiber to the extent they have very good access in a large area on the public cloud. Now, one of the things they're after and one of the reasons the companies that do this the list is, hey, people will use hybrid clouds and you'll quickly see that having fast connectivity to the public cloud infrastructure is pretty much not a requirement but you can do reasonably without that but that will certainly make it a feasibility in a very big way. I have a feeling that if it's not happened, the people will be using a lot more of this. >>: The hardware may be there. >> Sharad Mehrotra: Yes, I'm not sure how much actually of using that this happens. So we are going -- if we solve the problem maybe we will basically, people will use, yes. >>: You have a hybrid cloud, the whole ends up responsible when something bad happens, will there be finger pointing between of the private and public people? >> Sharad Mehrotra: Good question. I'm not sure. I think the hybrid cloud the responsibility of running it belongs to you. You can get virtual machines and you can connect those virtual machines in a particular datacenter to your infrastructure and you're running it yourself as a company. So probably the responsibility will be yours. Now, if the virtual machines do not give you the SLA they're guaranteed to give you, maybe we can go after the public cloud. Okay. So these are end goals. What we wanted to set out is the following. We wanted security. Again I'm going to drop risks altogether. It will be zero risks. Fully secure. We wanted a starting point. No leakage about any sensitive information in the public cloud. We've wanted to use the public cloud. One easy way of achieving the first is run everything on the cloud and you're done. That's not what we want. We want to run it on the public cloud itself. We want to limit the burden on the end user. If someone is in MR programming I don't want them to reprogram things at all. Should be able to run it as is without any changes. We want it to be generic but MR most things compile down to at least in one version of the world they compile down to MR. It's okay. We're, by the way, doing the same thing with Spark. We're planning to do this with Spark. We're exploring that. So hopefully if you don't use MR Hadoop, if your Spark that's fine, too. Some near future maybe. Okay. And finally the main question is the following: This is the way I would hit the road. You want the world of security basically not to be too high, you want to be practical. The question is what does it actually mean. So the first question is compared to let's say the other obvious solutions run everything on the private side, this should be significantly faster. Okay. Otherwise no go. Completely. And second thing is, I'm not too concerned about, but anyway it's an important thing to say anyway, is should not be much worse than running without security. If let's say MR job, Hadoop without worrying about security in this hybrid cloud environment, whatever performance I get I should not be a factor away from that or many magnitudes away from that performance. I want to be relatively close to the native Hadoop implementation as well. This is going to be very difficult to meet. >>: Security, is it no leakage. >> Sharad Mehrotra: No leakage whatsoever. But I'm not using encryption. So we'll just play around with how things are shifting back and forth. So you'll see. You call this efficiency. So before we -- since we're talking about security we have to attack model. Attack model in this case is very much what we normally use in the cloud environment. It's honest but curious, passive attack model and we assume that this guy on the side does not alter databases does not alter results and so on as well. What can he see to attack you? Anything that happens on the public side is visible to that bad guy, to the adversary, obviously. Not just that, any interaction between the private and public side is fully visible to the attacker as well. So adversary has full knowledge of all the stuff through, if you're shuffling data, transmission of data back and forth, it's fully visible to the adversary as well. On the other hand what you do on the private side on your own side is not visible to him. Now you can actually attack that point. You can say that, hey, depending on the timing constraints maybe he's expecting to see data X being transmitted to him. It took five minutes versus or five seconds versus six seconds that's visible to the adversary. True. We're not secure against that attack. We're not secure against that attack from that adversary so far. We have to have a sensitivity model. The question is it's about protecting data that's sensitive. What is sensitive data? So sensitive data is basically data that you do not want leaked. What I'm going to do, put caveats here. Maybe we'll kill the animation. Put it here. So to define what is sensitive, it's probably easier to define what is not sensitive. And you'll quickly see that as I scoped all the inference of attacks in this kind of environment, which is, by the way, exactly how database systems also work in reality. So, for example, if you have sensitive data, any data which leads to inference about sensitive data, any correlation attack of that kind, I'm not going to allow that. Because if your data can reveal some information about sensitive data, you better call the data also as sensitive. So there's this partitioning available to you that this is sensitive and that's not sensitive. So, in other words, the adversary completely sees all the nonsensitive data, he still can't know anything about the nonsensitive data. Privacy works in a different fashion altogether. I'm not concerned about that part. This is pure security of sensitive data. >>: Is this -- >> Sharad Mehrotra: So you as a user define it. You may specify that all methods, salaries are too high, a person has been fired, a sensitive number. >>: Challenge the set of movies you've seen, [indiscernible], don't even know up front that that's a potentially sensitive ->> Sharad Mehrotra: That's what I'm saying. I'm limiting myself to applications. So more towards SQL security setting where you have defined the sensitivity based on predicates or whatever it is. It's not addressing the -- it's not addressing the inference and challenge. Because if you go in that direction, there's no end to it. Then you resolve a differential privacy problem, which I'm not going into at all. There's a large class of practices which fall and you use as a sensitivity model. >>: Sensitivity [indiscernible]. >> Sharad Mehrotra: Not necessarily. It could be file by file. It doesn't really matter. It does not matter. The example we'll do with the MR framework. What's the record in MR framework necessarily? A lot of times there's a record. Sometimes there's not. It will not matter. Okay. I'm going to make some assumptions further. And this is in some sense limiting ourselves further. So one assumption I'm going to make, if you compute some function on sensitive data, the input is sensitive, the output is sensitive. Now, you can argue that it's not correct. And actually it is not correct because lots of times the computer function, the output is not invertible, you can't do input. This is a very conservative assumption. We won't deal with the conservativeness. Anything other will only help us; it will not harm us. All right. All right. So we're going to do MR with basically sensitive data. So in this case it's record oriented. Though it could be column record oriented. It doesn't really matter. In this example, let's say the dataset consists of name, disease and treatment dates and so on. And I, for better or worse, assume that if somebody has got cancer, that couple is, that record is sensitive. Okay. All right. And imagine running a MR job. What the MR job is trying to do is something very simple. It's trying to create a list of persons along, name of person, along with a list of, let's say, diseases the person has. So Chris, flu, James flu, Jean, acne, cancer, so on. If I look at this, the record in this case is had cancer, had cancer and Jean cancer. These are basically sensitive records. These were not sensitive. This reducer function runs. It generates data about Jane and Zach, this is sensitive because input of Jane and Zach is sensitive. This is also sensitive. >>: Are you predicting yourself, you have a reduced function that has a red input and you show the black output. >>: So this reduce function, if you think of it as working on key by key. So this working on this key, this key is sensitive, that key is not sensitive. Okay. Okay. So first it's a full MR system. So users specify what files, using predicates and so on, what is sensitive and not. First thing is how should data be distributed in the hybrid cloud. From HDFS perspective, it's straightforward. You look at the data and there's a master who decides the placement of data. Nonsensitive -- sensitive goes on private side and basically nonsensitive data gets shipped off with the public side, not a big deal to do this. You cannot expose any sensitive data anyway. All right. Now here's the question, let's see we start running, we've got sensitive on the private side, nonsensitive to the public side. And let's say start running this MR job. Clearly the mapper which is running on the side of data which is sensitive is taking sensitive data. So obviously it will have to run on the private side. So this mapper cannot run on the public side. Any way, we would not run on the public side because you want mapper to run close to the data anyway. So that's okay. Let's go down to the reducer. In this case there are two partitions and so they're two reducers in this case. This reducer, which is dealing with cancer records, sensitive records, where can this run, can this run on the public side? Of course not. It has input sensitive, it has to run on the private side. Whatever that guy, this guy runs on the top two records. And no records from here and no records from here. So it's running only nonsensitive data. Correct. Can that run on the public side? >>: I would say no, the pattern of which reducer runs where leaves information. >> Sharad Mehrotra: Absolutely. In fact, it cannot. Because there's a key inference attack, possibility. Specifically what happens if you run this to the public side, what will happen is that at this stage, the adversary would know, if nothing else he would know that James, who is -- yeah, James, right, does not have cancer. Because if James had cancer, then there's no way this reducer would be allowed to run on the public side. Adversary finds that James does not have cancer. Since James and Matt record gets sent here the probability that Jean or Matt one of these two guys has cancer increases. If there's knowledge in the data X people have cancer the probability increases. So it has to run on the public side. Sorry, on the private side. So effectively what we did is we went around and only said, okay, the first map operation can learn on the public side on the nonsensitive data and everything else has run the private side. >>: It occurs to me how does that also become secure, because it seems to me you're making some assumption about the addressing knowledge. But example, if the adversary knew that he had a record about Zach and it is not visible in the public side when you are mapping, so the adversary is looking for that, he knows there's a report on Zach and it did not appear in the mapping in the public let's say he's able to infer. >> Sharad Mehrotra: Remember the assumption. I'll cut you short and answer it right away. Remember what I said. There is this knowledge -- there's a partition of sensitivity and nonsensitivity. Knowledge of what is not sensitive does not give away anything about sensitive information at all. So basically if he had got all the records which are not sensitive, he would never be able to infer anything about Zach. This knowledge that Zach is in the record is actually sensitive information. >>: Is that an assumption about addressing knowledge or is that -- >>: Sharad Mehrotra: I'm sorry, it is -- you can treat it either way but to me it's a definition of what is sensitive and what's not sensitive. I'm not allowing inferencing attacks because then I'm screwed. There's nothing we can do at that stage. We have to go into differential privacy and so on which is going to complicate matters completely. If I look at mandatory access control, work in databases and so on, and in fact data security most of the work, this is the model of security that they have. Here are records that are sensitive and here are records that are not sensitive. You can expose nonsensitive anytime you want. Sensory data should not be exposed. If there's an inferencing possibility between the two things. All bets are made you made a mistake should not have made it nonsensitive to begin with. >>: Which is fine. But the second part of it where you the reason it has to run on private thing it seems to learn inferencing anyway. >> Sharad Mehrotra: That's a different kind of referencing. The problem is -- the predicate I have associated, which is this thing, the sensitivity is defined in the predicate. So the fact that James' record is not here, James' record essentially gets reduced here gives me additional information there's no other James record out there. >>: You can produce privately, one production in the public side and then a second production on the private side. >> Sharad Mehrotra: >>: There's only -- Two reducers. >> Sharad Mehrotra: So you'll see in a minute as to how we treat the problem. So all I'm saying is if I blatantly run MR. >>: Run MR -- >> Sharad Mehrotra: That's all I'm saying. We'll see how we can do this. We'll fix it in a minute. But the point is a fine thing. In this current architecture, the only thing I can run on the public side is this guy, the first map. That's it. Okay. And in fact this is a paper. So there's another group working in parallel with us. And they have a paper called Siri. That's what they said it's CPS paper last week which that's what it is they'll run the first map on the public side and reducing and everything else happens on the private side itself. Okay. And it makes sense if a job is very map heavy and a lot of the details not sensitive, then this does make sense completely. Question is we want it a bit different because if we look at database [indiscernible] a lot of them produce heavy drives. You are not getting any benefit of public machine for most of the work we end up doing. We want to do it a little bit differently. What did we do? What we did was something routine analysis. So effectively what I'm going to do is I'm going to figure out, when you do the mapping, this is sensitive data, I'm going to figure out what keys are getting essentially dirty or which are sensitive. So in this case Zach and Jane are the keys sensitive? Yes. >>: Put in standard database query optimization sense, isn't it true all you need to do is consider plans where only data flows from public to private, that's it. >>: In the previous plan also data flew only from public to private. Here also -- well, yes. In fact, there was no data which was sent from the private to the public side. So still a problem. >>: Example of secure plan. >> Sharad Mehrotra: Not secure. Unsecure, leave reduce to secure, is on the private side. It's not secure otherwise. But intuition, if you hold on for a minute, I'll be very clear. Your intuition is almost there but not fully. Let me say one thing. It will be clear. Hopefully. Here's the following keys I'll keep track of what keys are sensitive or not, which is these two in this case. Now, the current time for reducing. Now what I want to emulate is the following thing. I want to make sure essentially to cut the long story short I want to make sure that the behavior and information exchange, which is visible to the public side is completely independent of what is private and what's not private. What's sensitive and what's not sensitive. So from the observation perspective, the execution that I'll get is identical to the execution as if there's no sensitivity whatsoever. If that's the case, I'm going to go observation equal and hence secure. >>: If I know that J and X exist and I look at the public cloud, don't I ->> Sharad Mehrotra: That's the question, too. So we don't -- that's being scoped out from the assumption we've made that's been scoped out. >>: [indiscernible]. >> Sharad Mehrotra: Yeah, that's the case in the U.S. market. >>: If you are worried about attacks like that you mark all the records as sensitive. >>: Can't do anything in the cloud. >>: It's not an easy problem. >>: That's abstracted away. >> Sharad Mehrotra: >>: That's abstracted away. That problem is extracted away and the definition of sensitive. >> Sharad Mehrotra: Yes. All right. So how will I achieve that? What I'll achieve, the way I'll achieve that think again reducer two, with Jane and Matt's records are being basically shuffled to the reducer tool. I'm going to replicate reducer two action on both sides. So I'm going to actually replicate the reducer action on both private and public side. So in particular these records from Jane, and this will always be the case, the private, the public side will always siphon or shuffle data both to the public side and to the private side. So when reducer tool will get the action of Jane and Matt and so on it will generate records. Notice the difference between Jane and Matt. Matt's okay. Matt's record will generate flu/cold because of whatever record came in. Jane will produce acne here. The same record is here. Let me see if I have animation of it. The same record comes here as well. So you get Jane and Matt's record as well as the records from the private side as well. This guy has access to the list of what keys are essentially sensitive or not sensitive. So in particular, when it gets Jane's record, it sees Jane has sensitive record. It drops Jane's record completely. It will throw it out. It gets Matt's record too. But Matt's key is not sensitive. It will say I'll do nothing about it. So most of the worker reduction, if most data here is not sensitive is actually being done by reducer too on the public side. Some portion of the data for the sensitive part of it, the one for which the key is sensitive is being replicated and done by reducer tool right here. >>: You have to shift all the data. That's the whole cost. >> Sharad Mehrotra: Yes. Hold on for that for five minutes more. We'll be there. In fact, that is the fundamental question. So you're absolutely right. But I'll show you that actually the quest is better. So this is what I will do. Okay. Now, once I have done this, this guy will have reduced and produced incorrect answer, which is Jane acne. And there's correspondingly right one here. It's not rocket science to figure out which of these two is basically clean and which is dirty. So there's a final filtering step. We'll filter step which will be there which will get these records together. Throws acne out and keep the other one. The logic of it is a bit more complex but it's doable. Okay. That's okay for one MR job. Now, the question is what happens if there are multiple MR jobs. If the sequence MR jobs. That gets a bit more complex. The reason that gets complex is in the most naive implementation, what will I do? I have the right answer here. Now, forget the cost that Don was mentioning, let's not worry about that for a second. I'll ship correspondingly work from here to that public side. Going back to intuition, if I ship anything from the private to public side, I'm screwed. It will always give away information. If you look through the filter look, at all the records, all the output is sensitive in our model. We cannot shift things back. And which then tells me if I have a multi-layer MR job the only thing I can do is new processing here on these wrong records. But if I have new processing on new records, I somewhat have to have a logic built in that if I have the processing further up in the second MR job what is tainted and what's gone wrong, the knowledge of that should be available to me. And then there are lots of design questions how do you build that logic in. The one we ended up choosing after a few efforts was a very simple logic. This reducer, when it looks at the data coming from the public side, not only does the right thing which it has to do anyway, it also replicates the work done by the wrong world done by the previous guy. It will generate Jane acne as well. Okay. So Jane's record is the right one. This is the wrong one. And this is also wrong one. This wrong one will be used to cancel that wrong one. Now, if you have multi -- this input now is going through the next layer of MapReduce jobs. Okay. So I don't know how I'm doing on time, I might be over already. >>: The other -- >> Sharad Mehrotra: All right. So believe me with a proof real processing and marking of the keys, you are maintaining this information about what is dirty and what's not dirty. Basically alternating the processing on the public side as if everything is okay. And you will cancel -- you maintain enough state that once the record merged which will be done at the very end, at that time you'll have thrown away the wrong records and keep the right ones. The details, if I skip I won't be able to answer Don's question. I'll answer his question. So I'll skip this, this is technically more detailed slide. I'll skip this one of how to maintain the information. Before we answer this question, first question is security. From the perspective of security, this is the execution is observationally completely equal as if there was no sensitive data at all. So the adversary doesn't learn anything at all. You can start the game with standard proofs to prove it and we were able to prove it. That's not a big deal. Okay. If I go he back to design goals, security, yes, check, because we proved it. And public side usage, yes, because our maps and reduce, everything works on the public side as well. Able to fully use the public side. Limited burden to end user. Most of the logic about what we're talking about is implementable within the MR framework itself. The user doesn't do anything at all. The only thing is to have a mechanism, what's sensitive, not sensitive, which is something you need for a system of this kind. Can be done without any burden to the user. Generic to the MR generic, assuming it's generic. The key question is is it good and that's the fundamental question, you're shuffling data to the public to the private machines. Is the overhead too high or practical or not? Again, there are two tests for us, one is compared to what. So compared to all private, or compared to native MR execution? Two words like that. Look at overheads, where are the overheads. First we're doing key generation, overhead of key generation. Turns out not very high for number of jobs. We were initially thinking of representing these key sets in broom filters and efficient sets. It's cool. It doesn't matter. There's overhead in all implementations, all the tests, there is never too high. If it turns out to be high we have techniques galore on how to make set implementations set checking that we have to do efficient. So it's not that really big overhead. Extra incorrect processing filtering which we have to do, yes, that's overhead. But again we are designing this for the point for the system in which most data is not sensitive. Only sensitive data has to be pruned out. So if that percentage is small, this is not too much overkill as well. This is a killer. You're overshuffling. Now, you have to fail to us, if you think of overshuffling what am I comparing here. If it's other solutions like CEDEK [phonetic], has the same shuffle from public to private. Compared to CEDEK we're better. The question is compared to nonsecure Hadoop we're shuffling more and we'll be shuffling over wide area network, assuming that the public/private sides are in a slow network. So it's significant overhead. Obviously if the network is as fast as LAN we'll get better. We'll be okay. We'll start competing with Hadoop at that stage. Okay. So let's do a quick analysis. And I'm not going to go into details analysis, but more than to point out the parameters that one has to view. Here is the -- this analysis is very easy because you can figure out what the cost of MR jobs additional overhead is given MR. Compared to let's say all private versus semi, which is what our system is. So the assumption is that this is the initial data. This is the intermediate data. This is the speed at which per byte of initial data, the map speed. This is the reduced speed and so on. So the amount of time a system will take to reduce the job will be D star B star the results of the speed of reduction, reduce, divided by the number of machines, for example. So you can figure it out. You can also figure out the shuffling cost as well. So these are the costs. And I will skip past, believe the math is hopefully okay. So let me go to -- so you can compare. So the comparison is what? This is the cost, expected cost of doing all private. This is the expected cost of doing it in SEMROD. If this is more than that, I'm doing better. So I can possibly, inequality right here, and I can analyze the parameters of this inequality. What turns out is first observation if beta and beta star are slow, that means the machines are slow. You've got slow machines in your public and private side, good for you. SEMROD is going to work well because you're basically reducing the, you're increasing left-hand side. So it's going to be better for you. All right. If you're going to powerful private machine, then don't worry about cloud anyway. This is meant for that then you don't have that. Smaller over LAN speed, beta is LAN speed more closer you are to LAN speeds the better for you essentially. The smaller theta the smaller the right-hand side and then it will be better for you as well. Okay. All right. The main question -- oh, one more thing. Smaller the, in this equation, the smaller the end private is the private machines the smaller that number the better for you again. The main question, which is the lambda, which is the number of private, public machines over private machines, what happens in that ratio. Okay. And it turns out that that's a little bit more tricky, and I'm going to cut through the math of this and just tell you what it is. If you think about it, you have a certain amount of job, the expectation is that any sensitive data will happen on the private side any nonsensitive data will happen on the public side. So if alpha is a sensitivity parameter, if you have got one machine here you should have basically one minus alpha, alpha machines on the other side, then it's completely balanced. There will be load balancing situation. It turns out in this equation that that actually is the optimal amount of public machine to get. So if you get that many machines, or if you get let's say less machines in the public side compared to that, then increasing lambda, which is the number of public machines to private machines, up until that point actually gives you a better chance of let's say satisfying the inequality. So it's good to have up to that machine. Beyond that machine, the equation shows that it's independent. So basically the number active public to private ratio machine will not matter beyond a certain point. Which is not unexpected as well because when you increase you've got multiple machines but the data is too sensitive so it will never go to the public side anyway. So effectively what you do after this analysis is kind of figure out what the important parameters of the equation are, which is basically public, public/private issue which is lambda. The personally sensitive data, and then speed of LAN versus WAN. Let me show you some results. And the key question is given these, does SEMROD do better under realistic assumptions of this guy, that guy and that guy. So we did all the experiments to kind of figure out the in the space how things work. The first experiment is done, we have a small cluster. So the experiment is done basically with on UC itself with some nodes being private and some nodes being public. Now, the clear public, same cluster divided into two parts. For Internet work, or, let's say cluster kind of performance, we added delays to mimic essentially by data network as well. And then we change the ratio of the white area from all the way one, LAN and WAN the same to LAN being 100 times the LAN just to get a sense of how things go. And we also experimented with this in a realistic kind of setting we got machines at USCD there's the public machines and private machines and we were doing across the cloud as well. Experiments there. And we ran this over variety of benchmarks and TPCH to high bench and so on all the way from page rank algorithms, gaming algorithms, sorting, tera sort, all that stuff. Let me show you a couple of the results. This is at sensitivity ratios, and this is the Y-Axis here is speed up with regard to all private and the red is CEDEK the map optimization. What is the SEMROD. So at certain stage when the sensitive is very high, 50 percent of data is sensitive, the two are similar otherwise we have a significant advantage over CEDEK. Okay. And this is for the multi-level jobs. So here are multi-level jobs. So CEDEK doesn't benefit much from those jobs because reduce public maps can be done on the public side we have advantage compared to the all private. And this is amortized and averaged over all different jobs that we ran this over. Let me show you one more result. Let's see the right one. This is for the integral network this is the part Don you were asking. So here I'm running the issue here is capturing that LAN equals WAN speed all the way to LAN being 100 times faster. And the ratio of the machines is 1 to 17. That means one than private and 17 public. Or one to five. That means one private and five public machines. And if you look at the performance, and this is speed with regard to all private. So we get even when the LAN is kind of pretty slow, 100 is very, very slow at that stage, we still are getting some performance improvement over basically the running it on private. If the WAN speeds are better we obviously get much better performance. >>: There's a performance aspect. There's also a cost, monetary aspect. So in all public clouds, all LAN communication is free. And the bad communication is really expensive. It's like everything. >>: Point well taken. >>: The ratio is infinite in the current cost models. >>: So this is not considered cost of, which is important parameter. Absolutely right. >>: Not going to one, going to 0. You're going to 0, right. >>: Sure. This is not considered the cost model. Absolutely. It should be a cost aspect to it. Okay. And these are the results over actually the next one is slightly better. So this is like a more detailed result over different jobs. And what it's showing is the relative speedup with regard to all private. Sorry. Should have gone to the previous slide. There's the other question about how does computer compare to Hadoop. You can be pretty bad at times. This is an example. CEDEK up here is SEMROD and if you ran this on Hadoop on mixed cluster you get speed up to six times we're getting it only up to two times. So sometimes depending on if job, native Hadoop will do much better as compared to secure Hadoop. >>: [indiscernible] analytical model doesn't capture this. Why didn't it say more than the [indiscernible]. So you have to start some jobs locally. Transferring large datasets. So that will ->> Sharad Mehrotra: It will affect, yes. So at the end of the day. At the end of the day, I'm hoping that the solution of this resides with something that you guys are doing. If there's a secure component of the basic cluster which is in the cloud itself, that will be so much ideal. Because then never to worry about what he's saying. I always have 0 -- more cost for paying for whatever that extra security. So yes. >>: So I should see this graph, require it as being the all public file. >> Sharad Mehrotra: No Hadoop line is not all public. Hadoop line is private public but no security. So forget about security. Just run it as if the machines are all yours. >>: So then one of the points that you started with was that it can't be too bad as compared to any one of those cases. Where is the all public line. >> Sharad Mehrotra: Compared to all private line. Not all public line. I see, I see. I see. But, okay, I see what your point is, but I think what will happen is so all public, if I've got let's say five private machines and 15 public machines, would you want to compare it to 20 machines or 15 public machines. >>: Either way. >> Sharad Mehrotra: We haven't done that experiment. but we've not done that experiment. >>: I see your point What is the all public better than the all private because -- >> Sharad Mehrotra: Many more machines. Faster machines. So right now our assumption has been in this setup, in the real test, the machines are the public side were much faster. They were basically as DSC machines and ours is a small little cluster, which you call cluster. So it's much slower. So that's absolutely that point is also valid. So experience also. But in the similar experiments they were on the same power machine essentially. So we could have done -- we didn't do the experiment. We compared it to all private. And did not compare to all public. But presumably all public will be around Hadoop itself. It will not be too much more, maybe it will be because it depends on the LAN speeds and so on. It could be even better than that. It's possible. Okay. So where are we going with this? Well, the key question -- so what happens in the execution of this is that at initially some of these are sensitive. When you compute more get sensitive it spreads. The sensitivity spreads. There is always a point, and this is a query optimization problem where you shifted back to the private side that's better. How to do that we didn't fix it. We should fix it. Basically we should do something smarter here. So there's lots of small little things that one can do. And the most interesting to me is this entire model is very suitable, I would think, not much for MapReduce but for Spark or Asterisk and so on. The reason is simple in a general workflow system you can maintain state and reuse the state and so on. And Hadoop is a silly, you know it's a silly problem in Hadoop. Have to store things basically you lose data. You don't do partial computation. There's a lot more to do in that kind of setting. Other thing is I started by saying we'll give a talk on risk. I didn't talk about risk at all. This is 0 risk. So there's a natural extension of this work to quote risk management so on introducing boundary to boundary so on. We did initial experiments. It's not easy. Basically the results showed that it's not an easy problem. A lot more work to be done in that area. >>: Formulating -- okay. The performance. >> Sharad Mehrotra: True. Okay. So my last slide, sorry for taking such a long time but basically if you go towards cloud you have loss of control. We kind of all agree on that. Leads to price security concerns and we have focused on security and encryption which is great, wonderful. But I think there's a lot of power in not even going towards encryption at all as the only solution, as the only approach to let's say secure processing. Power of using let's say trusted hardware available to you, whether it be a client machine or whether it be a server side itself is a very useful direction and I think that was the point of kind of we didn't do trusted hardware, we did trusted client, it's the same -- there are differences but it's along the same direction only. I think that's the right approach at least worthy of exploration as well. We have done a lot of work on this modeling and so on but it's not really mature, our risk is very simple. We're basically size 0 risk one risk if you expose it 0 risk, a lot more risk to be done in the risk modeling. >>: When you said part of this you could use secure hardware to improve it, that's not completely obvious because I think one key assumption is your data needs to be stored in the public, not in the private, because even the initial data of who has what disease, that is hidden. That can be -- but any architecture in the cloud metadata itself resides there then again other things can go wrong. >> Sharad Mehrotra: All I'm saying is I think mapping this into working appropriately on secure hardware is not a straightforward thing. It's not trivial. It's not trivial at all. >>: Even data storage is a problem. >> Sharad Mehrotra: I agree with you completely. How you use secure hardware is going to be interesting in reality. It's not a done deal. No, I think it's an interesting direction. It will overcome some of Don's problems that he's defining, that's the main question it's the approach that one should explore. What we have done, I talked today about SEMROD. If I get a chance let me talk to some of you about the work we've done in Hypervisor, would love to get your feedback, and so on. And CloudProtect, I know Don has done something similar as well one of the things he did when he was at ETHT. It's very related because I read the paper of yours. So we've done work in that direction. So I'll stop at this point. Sorry it took a little longer. I'm not going to go into other things right now. >>: [indiscernible]. [laughter] [applause]