>> Matthai Philipose: It’s my pleasure to introduce Seungyeop Han who’s a PhD candidate right now at the University of Washington. He’s a grad-student working with Ivan [indiscernible] and Tom Anderson over there. Before that he was an under-graduate at KAIST. Actually and in between he did three years at [indiscernible] or the big internet companies in Koya as a software developer. He can speak code. He’s interested broadly in mobile network and distributor systems. He’s built several of them during his thesis. If you read his references, one of the things that becomes sort of apparent is he served on three thesis worth of work, and builds three corresponding systems during his work thesis there, during his stay. One of them is the app Fence and related mobile privacy and security work with [indiscernible] and David Wetherall who… >> Seungyeop Han: And Stewart. >> Matthai Philipose: And Stewart actually, yes, sorry. [laughter] I’d written it down here but, before David ran off to Google, Wetherall that is. Then he worked with many of us here at Microsoft Research in the Mobile Networking Group. On the [indiscernible] and MCDNN Speech and Mobile Recognition systems, the distributed runtimes for some of these things. But the [indiscernible] thesis turns out to be about a set of kind of wide-area distributed services that he’s going to talk about today, MetaSync and IPv6, SYTCOM, that kind of work and SEI sort of work. Without further ado I’ll let him talk about that. That’s what he’s focusing on. >> Seungyeop Han: Okay, thank you Matthai. Hello everyone. Can you hear me well in the back? As introduced I’m Seungyeop from the University of Washington over the lake. Not that far away. I’m from the Networking Lab but as mentioned I’ve been interested in many different areas in Computer Science. I’ve done some of the distribution work and security and privacy, or even some computer vision or machine learning related work. But today I will talk mostly about like systems research for untrusted or unreliable environment. In recent years how people used the computing devices has repeatedly evolved. We have diverse devices from like tradition computers, like laptop or Tablets, and Smart Phones, and like the watches and glasses. Now the Internet of Things has asked that all different devices are on the, or on you will have some sort of computing power, and will be connected to the internet. Along with new application the amount and the types of person information have ever been increasing as well. It includes user’s location or contact information, house information or something like your search queries, etcetera. It becomes very easy for the users to access their information through the application and the devices. Unfortunately not only the users but the other parties can also easily access to your personal information when we are using internet there are so many entities that not trusted or not reliable. These are raising the security and privacy risks. At the same time users may not understand what’s happening and do not have much control over it. We can’t actually change the first two things. Ultimately we want to change the last point. We want to give users control over how their information is exposed and used from their remote services. Let’s look at a simple diagram showing the interest involved in the network communication. You may run some application in Smart Phones or computers. It’s connected to the internet. It’s talking to the remote service. Any of these interests in this diagram can be untrusted or unreliable except for yourself. Throughout my PhD study I have studied various issues around each environment. Let’s talk first about the application. As Matthai briefly mentioned I worked quite a bit on the Smart Phone security and privacy. When I was starting my PhD in two thousand ten Android phones started to become quite popular over the world. It’s also an open source. One of the great things with the Smart Phones, not only Android but also for iPhones and Windows Phone, is that you may download different application from the application market. The good thing is we can enjoy the functionality from the third party applications. But at the time and probably even now we don’t understand much about what’s going on when we are using application. What it means the application can access many different types of personal information. We don’t understand whether they are sending them to other places or they’re just using there. I kind of participated in the TaintDroid project which is an information-flow tracking system. It reveals which information going to where. Like your location as going to the Bing Map maybe fine but it’s also being sent to the advertising server, as well. Based on the project with my colleagues I proposed protection mechanism which it’s called AppFence. We were using two different mechanisms. One is kind of selectively give false information to the application or we could block the unedited information from the device to be sent to the remote services. I also worked on studying the Third-party Tracking. Like advertising and testing security, sorry advertising analytics, and also testing security vulnerability in Smart Phones by combining study and dynamic losses. Moving onto the next part the network can be untrusted or unreliable. Fundamentally, the current internet was not designed with the security or privacy considerations. We were looking into how we could redesign the whole internet with the principal that all the network elements use the very minimal information to allow the internet communication. One of the key idea, I mean key design question we were using was that we use Tor Onion routing as an addressing scheme. Instead of an IP address its kind of host use a Tor as a network protocol. It’s through the application proxy. We could demonstrate with the proper design we can communicate without having many kind privacy and security issues. Finally, there could be a little more obvious adversaries at the remote services, so which I will talk in the [indiscernible] today. By the way if you have any questions during the talk just feel free to ask. In this talk I will talk about how we should design systems when the remote services are not reliable or not trusted with the two different example systems. First, I will talk about MetaSync which is a file synchronization service across multiple untrusted storage providers including like Dropbox, OneDrive, and Google Drive. Then I will talk about Pseudonym extraction where each application or user can use many pseudonym identities to control over information exposure. Then I will briefly introduce some other work I did and the future direction. Let’s first look into the MetaSync. There are many different Cloud services. Among them I was looking into the file sync services. One of the most popular services for users as kind of backing up your folders into the Cloud, also many different types of applications are relying on those services. The services like Dropbox or OneDrive make it easy for the users to access their files from many different devices. Also to prevent from losing their file contents. Also we are using those services for sharing files with other users like your friends or collaborators. With this convenience they are getting very popular. Last year Dropbox announced that the number of their users reached four hundred million. There are also many similar services are provided by various companies including Microsoft except as OneDrive and Google Box. Recently the Chinese companies like Baidu and [indiscernible] started to provide users two terabytes of free space, for free. This sounds great, we can use those services. But the question is can we rely on single service for keeping a hold of our files or for sharing files with others? We are expecting them to work well. But fundamentally there is no reason for us to trust those service providers. Some of the service may come from small company that may become unavailable time to time. Some others may be provided by other countries so you may not trust them. We have seen many incidents that your data stored in those synchronized service at risk even with some relatively trusted companies like Apple or Dropbox. How can we protect our files? To protect files you may think of encrypting your files before storing into the Cloud service. They’re effectively serviced doing that. But it does not provide any better availability. We also can build a new system, whole new system from scratch to use those Cloud services with minimal trust. We’ve seen several systems especially in the distribution system community. But here I want to tackle the problem from a little bit different direction. We are building a file sync service by exploiting existing services. Each service has their own unique feature to differentiate from each other. But the core functionality is to allow the users to put their files into Cloud service. Allow them to access from different devices. We are building sync service over the APIs provided by those services. By combining them we could provide higher availability and greater capacity, higher performance, with a stronger confidentiality and integrity. Yeah? >>: How is the availability today? >> Seungyeop Han: Sorry? >>: How, what is the availability today? What is the percentage of ability for any of these services? >> Seungyeop Han: We don’t have number; I mean I don’t have a number on the top of my head. It’s not, they’re mostly available say for kind of, they are kind of pretty much [indiscernible]. But if you look into the cases there’s some availability checking service. It’s not kind of rare to see sometimes one of those services is not available. We’ve seen for example there’s some service which can permanently save their service after all. >>: Can you show us some numbers comparing what is higher in your results? >> Seungyeop Han: Right, so we are not kind of showing as a number. But it’s kind of by design we have better availability. Meaning, even if some of the services are failing we could, the users can access their files. >>: Okay, but I mean that’s a very good question, right. Suppose the availability was finite, let’s pretend, okay. Let’s say you added one extra line on top of it, is it worth it? >> Seungyeop Han: That’s a good point. >>: Why do you care? Yeah so unless you have the current numbers you’re asking why build MetaSync? The [indiscernible] system that we will see. >> Seungyeop Han: Right, so… >>: If it’s just availability unless you show numbers it’s not clear why you need MetaSync for availability? >> Seungyeop Han: That’s I think fair. >>: You can’t show numbers for rare catastrophic events, right? >>: But even, no… >>: I mean I’m sure MetaSync is figured most potential the number for that, that’s fine, right? >> Seungyeop Han: Yeah, I think that’s fair. In terms of reliability I mean, we, I don’t have number unfortunately. But then I… >>: That’s fine but… >> Seungyeop Han: Then I kind of show that there’s incidents that the files are not, the services are not available. This kind of design could help in the case of those unavailable services. >>: Another quick comment I want to make is there’s a greater capacity in every single service at [indiscernible] has an infinite capacity right now if you want to just pay me. >> Seungyeop Han: Right, so it just kind of started from very naïve idea about can we use more free space. >>: [indiscernible] availability… >> Seungyeop Han: It’s weird saying the greater capacity has like someone threw on a, gets more free space… >>: [indiscernible]. You can use everybody’s free space… [laughter] >> Seungyeop Han: Okay, those are good questions. In addition we have two more goals in MetaSync. First we don’t expect there’s communication between the service providers because they are provided by different companies. We don’t want to rely on the communication among the clients. Lastly, we don’t want to introduce additional servers to, for using the system and make everything run in clients. >>: Seungyeop can you tell us a little bit more about the justification for these two goals? Like, you don’t want to have some service that you’re running in the Cloud that mediates between the different services. You don’t want me to have multiple clients on my computer that will, then there’s some client in there that will mediate between them? >> Seungyeop Han: Right, so the thing is that for the communication between clients. Meaning we could, let’s say for the file synchronization service so you could access those services from like your computer or your Smart Phone. Even if, especially if you are sharing a file and folder with other colleagues, several different devices or different users can access those folders at the same time, right. That’s kind of the assumption we have. We are connecting to the multiple service providers and communicating through that. For the justification I think the first one is kind of obvious between service and service. But the communication between the client we might need some mediation as a server. Or something like GHT or whatever to figure out which clients are available like online, and etcetera, which kind of inquires some system complexity, as well. That’s kind of one of the reasons we didn’t take the peer to peer way for [indiscernible]. Reason for no additional server is one thing was little bit of a trust issue. We are climbing that, okay, there’s a little fundamental reason for you guys to trust Microsoft. It’s not making much sense that you guys should instead trust us for, because we are running a better service. Also there’s a kind of, it can be used if, as long as there’s the clients are, so clients are open source. Anyone can see the codes and they can kind of audit what’s happening in the clients. For the server again, server side another reason is we can’t kind of argue that we can serve the service forever. That’s kind of the reason of justification of why we are taking this to be more considerations. >>: That also [indiscernible] services a potential failure, correct, since he is not running any service, right? >> Seungyeop Han: Yeah. >>: That’s good. >> Seungyeop Han: From the goals there are three key challenges of the systems. We need to maintain a globally consistent view of those files across multiple clients and over multiple services. Furthermore, we want to build this system by using only the service providers’ unmodified APIs. We can’t force them to make their like some new APIs for us. Finally, this should work even if there’s some failing service. Let’s see how we designed this. This is the overview of the, yeah? >>: Just for my understanding, the APIs provided by different services do they have any inherent differences between them? >> Seungyeop Han: Right, there are, even some of the functionalities are a little bit varying over the services. But the core, so I will explain a little more about how we are exploring different services with different APIs a little bit. But the core functionality is more like putting files and getting files. We are using the service provider as more like a blob storage rather than using their synchronization feature. >>: I’m just curious about the security more than anything. Do they provide any APIs to increase the security? >> Seungyeop Han: What do you mean by APIs? >>: I don’t know, encryption if you could [indiscernible] encrypt API for [indiscernible]… >> Seungyeop Han: No, there isn’t, so we, they are definitely [indiscernible] encrypted files in their Cloud with some of the keys they are maintaining. But we don’t, and if there’s any exposed APIs to control over that. >>: How do support a sharing between two different users? Another particular example is anonymously sharing, right. A convenient was on OneDrive I create a link, I send the link to anyone; they can use that link to access pictures I want to share instead of dealing with their OneDrive account. I have to manage the permission, right. >> Seungyeop Han: Yeah. >>: I think for any encrypted file system like this I mean it’s easy to deal with a single user. The challenge is the multiple users. >> Seungyeop Han: Right, that’s a good question. For the encryption we are providing various kind of minimal like encryption layer which is kind of encrypting files based on the password. It’s also not kind of whole you’re space are shared but whether we are kind of using similar to version control systems repository model. You can designate a folder to be synchronized like multiple providers. You can share, it’s kind of you are sharing a password for the users for accessing together. Again, there is several… >>: [indiscernible] password to your own system? >>: Yeah… >> Seungyeop Han: The password for encrypt… >>: The passwords are embedded in this MetaSync layer right? >> Seungyeop Han: Encrypting… >>: How does this MetaSync layer share the… >> Seungyeop Han: Sharing is we are using each client… >>: Yeah. >> Seungyeop Han: Each service are sharing features, I mean sharing APIs to make each of the users has access to those repositories. The second question is how if it’s encrypted with some key how can we manage the share of the key? >>: Yes… >> Seungyeop Han: We are using a like password based encryption which pretty simple, just simple mechanism. Because there are several different orthogonal, I mean there could be some different orthogonal [indiscernible] and contribution. Encryption as we are taking the minimalistic, I mean the simplest way to do. For different users to share a folder it kind of, sharing is done by again API code. Then the sharing of the key is basically telling the password to, I mean the encryption key to the person. >>: There’s certain like unique features that some of these storage providers give you if they understand… >> Seungyeop Han: Right. >>: The content that’s in the file. For example for certain open formats Google will give you some newly rich semantics around version control. For images OneDrive will look for particularly problematic images and alert the authorities if that’s the case. >> Seungyeop Han: Right. >>: Do you lose those semantics? >> Seungyeop Han: Yes, so we are losing the semantics because it’s more kind of focusing on the, as I mentioned it’s kind of identified as some of the core functionalities. How to, putting files on to the Cloud and it can access them. We are not supporting those other functionalities for now. Some of them may or may not be implemented on top of this. >>: Okay and then does your system also support web clients? >> Seungyeop Han: Currently not, but it could be. >>: It could be done? >> Seungyeop Han: Yeah. >>: Just thought, I just want to understand the sharing model a little bit better. Let’s say I store a picture that is maybe… >>: [indiscernible]. >>: Yeah, no go ahead, go ahead, we’ll speak later. >>: Just to clarify the stuff about this line is that on the client side or is that in the Cloud. >> Seungyeop Han: This is a client side… >>: Client side, okay. >> Seungyeop Han: This is the… >>: That wasn’t a joke, that wasn’t a joke question? >>: No, no I just, it wasn’t clear to me that that was the client side. >> Seungyeop Han: Yeah. >>: They were following up on the line [indiscernible]. >>: Oh. >> Seungyeop Han: Okay, so… >>: [indiscernible]… [laughter] >>: [indiscernible] questions. >> Seungyeop Han: There are, where was I? Okay, so there are three sub components. One is object store and managing the files, one for synchronization, and one for replication. It’s mentioned there’s kind of some common abstraction for each of the service. Also in between that’s doing encryption and integrity check. Let’s look into the first part the object store. The object store holds copy of the files. Those copies will be later [indiscernible] and synchronized to the backend services. It has a pretty similar data structure with any other version control system like git. It’s using content-based addressing and hash tree to store the object. Because we are using content-based addressing the names of the files are determined as the hash of the content. The integrity check is pretty simple because you can just check the name, whether the name is matching with the hash of the content. Also it automatically de-duplicates because if the contents are saying it’ll be mapped into the same object. Finally, because the file name is unique each client can independently modify and unload or download the file object. Look into insight it would look like this. The directory maintains the pointers to the files. Files are chunked if the file size is too big. Or the files in a directory can be margined into one single object if file sizes are too small. Basically, it’s some kind optimization to make the file size over the object be relatively uniform. From this hash tree the hash of the root directory or do it through a repository as uniquely identifying the current snapshots. When some of the files are modified, like for example here the large [indiscernible] is being modified. What it does is its creating. The object store will create new blob for it and update its parent pointers recursively. Basically, it’s just kind of considered as an immutable object. This the logical view of the object store. We have files in the local file system. We synchronize those files into the backend. After those changes we have new, had I mean kind of hash value of the new snapshot. Yeah? >>: I have a question regarding freshness guarantees. If we’ve opened, we’ve shared a tree and we open the same file. I presume there’s some meta data that tells me that the file been update, will update? >> Seungyeop Han: It’s a kind of a… >>: By the way if you’re getting to it in the latter part of the talk… >> Seungyeop Han: Yeah, I will get into that but a little bit about that before actually getting to. Again, we, this kind of model is some following the version control system like git. It allows the other clients or shared users, sharing users’ kind of access or modify the files. But some, maybe later it needs to be resolved the confliction bi-manually. Simply saying we replicate objects redundantly across our storage of providers. R is some configuration number. For example, here with the replication factor two each blob will be replicated over like some of the two services, like this. There are several requirements for this replication. First we need to minimize the shared information among the services and clients. Because there could be many different objects and it’s not, doesn’t make sense to map like a store that all the different mappings about the, whether this object is going to like Google Drive and Dropbox, and the other objects going to some other set of the services. We also need to support the variation in storage size. Like some of the service providing like two gigabytes to the like fifteen, like two terabytes. Finally, we wanted to minimize realignment upon the configuration changes. Like if you are adding a new service or removing a service, or changing the space, kind of a location wanted to minimize a real alignment. I will not go into the whole detail. But basically we are using the deterministic mapping function can be created by small shared information. Each client can independently calculate the result of the function to say okay this object hash something. We can put this object to the OneDrive and Google Drive. >>: What if one of the services goes bankrupt or unavailable? Your deterministic mapping cannot handle that. >> Seungyeop Han: We, the function can be changed. Again, I don’t have the whole; it would take too much time to explain how exactly it’s kind of built. But that’s correct and it can remove one service. Then it remaps the object to the new services. But because of the, one of the requirements was we wanted to minimize that cost even if we are removing a service or adding a new service. It’s not, it’s costing like not as much as like other wrapping algorithms. I can talk to you a little bit more after. >>: Okay. >> Seungyeop Han: This is a replication. But I will talk a little bit about, more about the synchronization steps. There are two different things we need to kind of share or agree between the clients. The one thing that’s mentioned it’s kind of where the files are stored. The other is if each client is modifying files or folders then how can we know that like which one is, how can we apply the changes to each other? How can we know what’s the most recent version of the folder? As mentioned, the, each object can be independently updated and independently unloaded or downloaded. But the problem could happen if both, I mean multiple clients are modifying the folder and they are insisting that my changes are should be applied before yours. Let’s see how it works like, let’s imagine there are two clients and at the beginning they are synchronized into the same point. As mentioned if client one is kind of updating some files and it’s the only client updating then it can easily update the whole global view. The other client can catch up. But the problem happens if both clients modify files and they are trying to claim that hey, the next version should be mine. In this way we need some sort of mechanism to order the changes. Then the other clients can merge on to. I’ll explain this kind of synchronizing mechanism for now. You might realize that this is kind of a traditional distribution system problem determining orders in the system. We might use Paxos or two phase commit for making algorithms. They need to have consensus between the client to say what, how we apply the changes in which order. Paxos is a multi-round non-blocking consensus algorithm and is safe regardless of failures. It progresses if the majority of the acceptors is alive. However we don’t have a Paxos API or two phase commit APIs from the service providers. As mentioned in the goals slides we don’t have communication channels between servers and clients. The challenge here is we need to handle concurrent updates and potentially unavailable service, only relying on the existing APIs. What we do is we were looking into way to simulate Paxos. Where there is no Paxos APIs but we could devise a way to simulate with the given APIs. Especially, we found that it showed a service we could use their APIs to build append-only log list abstraction. Here with append-only log clients send the normal Paxos messages to the backend services and when the messages arrive the service just appends the message into a list. Clients then later can fetch the list of messages to figure out which proposal is accepted. This abstraction can be built in various ways over the service, like we built the append-only list abstraction with the comments on a file for Google Drive, OneDrive, and Box. For the Dropbox if you are overriding a file it’s creating the new like revision list. We could use the revision list as an append-only list. If those are not available we could use a sequence number of file name in a directory to make sure like it’s a list of file which has some order in that. >>: What part of the documentation for Google Drive and OneDrive makes you think that the commits on a file provide strong consistency globally. >> Seungyeop Han: Right, so that’s a good question. We don’t have, we were looking into that and we don’t have like real guarantee that they are that they have a strong consistency. But we were checking more empirically to see whether it’s working as we expected. But it’s kind of fair to say we don’t have actual guarantee that it’s working with a strong consistency. But kind of we are modeling them as something, with some linearization and we kind of build on top of the assumption. Now backend services work as a passive acceptor. They log each message from clients. We call this as a passive because the acceptors just tore the messages and accept decisions are made in client as if it was done in the acceptors. In this diagram each client may propose a new root like client two say new root is one and client one says something different. Then after reading the log if the majority wins it can claim or it can learn that the new root should be something from the client two. This is kind of simplified diagrams. An actual algorithm this needs to be done through two runs. Like prepare promise and propose accept in the real Paxos algorithm. But yeah, let me know if you have some other questions related to this. After kind of devising this we realize that there’s validity the Lamport was proposing something similar which it’s called its Disk Paxos. After an alignment we could think our passive Paxos algorithm is some optimization, some form of optimization of the Disk Paxos algorithm. Especially this in Disk Paxos its proposal need to access the per client block. The number of messages kind of order of number of clients’ times’ number of acceptors. But because of the append-only list we could reduce that into the order of number of acceptors. >>: Question? >> Seungyeop Han: Yep. >>: Is this different from like the Google system called the Spanner? They use Paxos and support some various [indiscernible] that provides consistency… >> Seungyeop Han: Right, so overall it’s not different from any Paxos kind of system or Paxos algorithm. But the thing is we are building a new way to run the Paxos. Like we can say it’s a new client based Paxos algorithm. That’s kind of the difference between any other Paxos algorithm. >>: If you have your [indiscernible] mapping which assumes the services are always there. Otherwise I mean [indiscernible] mapping is troublesome. Then you could even have a different [indiscernible] master and slave, right? For anything that’s replicated across services… >> Seungyeop Han: Right, so that… >>: That makes this thing simpler? >> Seungyeop Han: That’s kind of, could be simpler but it’s not cannot progress when the master is failing. >>: But even masters have failed. Even if anything failed [indiscernible] mapping has to dynamically handle it then it’s not a [indiscernible]? >> Seungyeop Han: No, no, no, so we are replicating over like some vector right. It’s a different, so there are two difference things again. One is for replication and one is for synchronization to determine like what’s the most recent version. For the replication because if some, if there is some failing service you can put maybe less number of services, I mean object to the less number of object at the moment. But it can be still accessible because you have a copy in one of those R servers. Even if you’re failing, there so like Google is failing you can for example get that object from like Dropbox or OneDrive. That’s one thing. The other is kind of synchronization it’s simulating Paxos. Paxos can be working even if there’s some failing services. >>: But if you’re talking about if you want to separate these two things then let’s kind assume there’s no replication. If there’s no replication this [indiscernible] clients are [indiscernible] to the single service. Then that kind of problem can be solved though with some other function instead of Paxos, right? It’s only because you are trying to get multiple things and you’re try to not just use your replication as a backup. You’re also accessing the replication for load balancing kind of purposes. Otherwise you don’t need Paxos, right? There’s something that’s not clear to me. >> Seungyeop Han: We can talk a little more later but again it’s two different problems, we’re solving two different problems here. Even for a synchronizing problem we need Paxos because the two [indiscernible] commit or [indiscernible] approach cannot guarantee the progress if there’s a single failing service. That’s why people use Paxos for like in ZooKeeper or Spanner in like a bottom line. We implemented MetaSync system prototype in Python because many service providers have APIs in Python. It currently supports five different backend services including Baidu, Box.net, Dropbox, Google Drive, and OneDrive. We have two different types of clients. One is command line client and the other is something similar with other native client which has kind of dedicated the folder to synchronize to the Cloud periodically. As one of the evaluation we checked the kind of end-to-end performance. For this we synchronize a folder between two computers, with using clients from the sync services and also with MetaSync. I presented the results of two different workloads. One is Linux Kernel source code that has many small files and directories, the other was fifty photos. You can see that we are outperforming but it’s something maybe not really fair comparison. Because one thing is we can get certainly some performance gain by design by like parallel upload and download with multiple providers. Also we are combining small files into a blob. But it’s unclear whether they kind of wanted to really optimize their kind of synchronizing [indiscernible] in performance from the native client, as well. But we are kind of; I wanted to show that it’s some working prototype. Then users can have other option for… >>: I’m confused by that statement. In the first row you’re uploading fifteen thousand files. >> Seungyeop Han: Yep. >>: You’re saying that MetaSync is faster because each of those fifteen thousand files is broken up into smaller chunks? >>: Other way. >> Seungyeop Han: No, no, no, the other one. >>: Combined at the… >> Seungyeop Han: They are combined… >>: I see. >> Seungyeop Han: Maybe per, it’s not combining everything into the larger blob. But we have some policy like if in a directory there are small files kind of smaller than some threshold we are merging them into a single blob. >>: But if you upload for example in Dropbox and Google if you upload a directory presumably they can do the same thing that you’re doing… >> Seungyeop Han: They could do, they are not doing. >>: Not only that… >> Seungyeop Han: What’s, that’s what I mentioned… >>: That I know is not at least true for OneDrive. OneDrive uploading the directory and uploading the file are two different options. OneDrive does do something interesting if you’re uploading multiple files from the same directory. >> Seungyeop Han: We… >>: I could be wrong but… >> Seungyeop Han: Yeah, I have… >>: But there’s no… >>: [indiscernible] >> Seungyeop Han: I have a number for OneDrive but it’s not here. But what is similar was Dropbox. I mean it’s faster than like other services, but… >>: But what does this… >>: [indiscernible]… >>: Upload, download doing? >>: They’re separate. >>: Oh. >>: There’s the file blobbing which you’re saying any of these guys could do in principal. >> Seungyeop Han: We are using multiple files to kind of uploading. When there are multiple, many different files, right. We’re unloading them kind of concurrently, one and also unloading concurrently to the multiple backend and downloading, as well. >>: [indiscernible] >> Seungyeop Han: One thing is its service has something [indiscernible] like per user per bandwidth. We could, some might overcome that as also it may not be the huge, the biggest factor. But it’s one of the factors that’s kind of parallel on download. >>: Have you compared this with limited approach where you could just zip the file at one client and unzip another, how does that compare? >> Seungyeop Han: For and unloading through the Dropbox or OneDrive and downloading. We haven’t compared that. Again, performance is something like in terms of we wanted to show that this is kind of one of the, potentially better options. But it’s not claiming that they are doing something very wrong. More on, we wanted to show that some design is working and like… >>: What does the [indiscernible] experiment that we can do? Just zip it and… >> Seungyeop Han: Yeah, I think that’s kind of interesting thing to do. >>: Clarification of the Dropbox column, Google column. Are those, is that the performance of the Dropbox client and the Google client or is that the performance of your Python code that’s only talking to Dropbox? >> Seungyeop Han: Their client. >>: Their native client or their rest client. >> Seungyeop Han: Their native client. Kind of running clients, its computer and we are putting folder into the dedicated… >>: The method I’m getting here is there are some optimizations that they should be doing. >>: This is not… >>: They’re not doing. >>: Yeah… >>: But the overall high level bit is that the whole synchronization protocol is not taking a huge amount of extra time suddenly converting other things. >> Seungyeop Han: Right, yeah. >>: There’s no huge overhead as that API… >> Seungyeop Han: Sure. >>: We need that to do [indiscernible]… >> Seungyeop Han: Yeah, we can talk a little more after the talk. But let me move forward. In summary I present MetaSync which is a, kind combining multiple file synchronization to build a file synchronous service. We achieve a consistent update with a new client-based Paxos algorithm. I kind of present a minimize, how we could minimize the redistribution through a stable deterministic mapping. If you’re interest in the source codes is available please visit the website. >>: If the client is disconnected are they allowed to edit objects? >> Seungyeop Han: They are allowed to edit the object but the object will be synchronized later. >>: Does synchronized later mean rejecting changes completely or merging them? >> Seungyeop Han: No they need to merge on to them. [indiscernible]? >>: On the previous slide how many Paxos operation are run? Is it one per file or is it one for the whole update? >> Seungyeop Han: Well it’s actually one for the whole update. It’s not really testing the evaluation of the like Paxos algorithm. I have graph if you’re interested. >>: Is it done at the end after all the data is up there? >> Seungyeop Han: Yeah. >>: If there were a conflict eighteen minutes and fifty-four seconds in it would then have to start entirely from scratch? >> Seungyeop Han: No, so it can, well the merging itself can be considered as somewhat separate problem. Like for example if you were merging two conflict, I mean some conflict in like git there is some way to resolve some part of a file to be okay. But there could be some of the files need to be like handled by marking okay this part is kind of merged by the user intervention. We’re not really handling much we’re just marking some of the conflict file as this is conflicted. But in general you don’t need to start from the scratch but there’s some way to figure out which files are kind of conflicted. Because it has also hash so that it’s easy to check whether there’s some files are same or not. >>: What is the state of the system? Let’s say there were two uploads that conflicted and this is the second one and the conflict was happened eighteen minutes and fifty-four second in. What state are we left in after both of those attempts have happened? >> Seungyeop Han: I so could you clarify the question? Like, so there are two empty repositories and both of them are uploading some different set of the files, or. Eighteen something minutes, eighteen minutes and what nineteen minutes it’s basically trying to synchronize a whole Linux kernel from the one computer to the other computer. That kind of thing, something to say it’s not creating the conflict. Because conflicting is usually maybe modifying some of the files over there and that could maybe involve with some number but not as long as like a nineteen minute or twenty minutes, right. Not sure I answered that. >>: We’ll check it out for you. >> Seungyeop Han: Okay. >>: I noticed that Amazon’s consumer file service is not on your list. >> Seungyeop Han: Right. >>: Qualitatively you know how did it differ? Did you purposely not consider them? >> Seungyeop Han: Right, so we are not, so we are missing it because we kind on focused on more like end user file services like a Dropbox or OneDrive. It’s fair to; I mean it’s kind of something we could include into the system to build. >>: When you were doing this did you observe any odd failures? You know one; the common failure mode I would image would be you know services up or services down. But did you see failure modes where the blob of bytes that you got back from the fall storage provider was not the blob of bytes that you wrote? >> Seungyeop Han: [indiscernible], so. >>: Okay. >> Matthai Philipose: Can we listen to rest of the talk guys? >> Seungyeop Han: Sorry? >>: [indiscernible] contained. >> Matthai Philipose: This is half the talk. [laughter] >> Seungyeop Han: Yeah, how many minuets do I have? >> Matthai Philipose: [indiscernible] >> Seungyeop Han: Three more, sure. >>: He was the last question… >> Matthai Philipose: I know but there has to be a limit right. >>: Thirty minutes. >>: Yeah, keep going. >> Seungyeop Han: Yeah, okay. With MetaSync I’ve talked about file sync service with unreliable or untrusted backend. It also can prevent the remote service from linking or corrupting files. But I would kind of change a little bit the gears into how we can prevent service from linking onto the user activities. I think you might have experience like after visiting Zappos to see some of the shoes. Ads of shoes are following you to CNN, the Fox, or whatever the news sites go. There are many different types of personal information. But some of them are not really obvious to see the privacy risk. One example was just mentioned the tracking behavior, tracking user behavior. When we are using the today’s internet we should [indiscernible] that the most of our activities are somewhat tracked on the website. That’s something done through tracking. I mean the example I just showed, right. According to the previous work among the top five hundred internet web sites, ninety-one, it’s also from two thousand twelve so may have more. But at the time ninety-one percent embed at least one tracker. Ninety, like eighty-eight percent embed like some third party trackers like advertising or analytics in it. Web services are very well incentivized to build a large profile, large big user profile. Because their revenue model is strongly tied to how they understand the users interest. Information collected from those remote services could be something from like your demographic information or location. To a little more sensitive ones like political opinion, medical information, or sexual orientation. Let’s look into how the tracking is done a little more deeply or more in the low level. In this example Alice is sending a set of queries like Microsoft route to her home address. Bob is sending another set of queries. From the tracker they may know the first set of queries is coming from one user. The set is coming from someone else. They are just linking those queries or activities in the remote side. What does it mean to us? The information collected from the trackers create very detailed picture of you. Because of this we usually think okay tracking is bad, tracking is harmful to us. But on the other hand we also get some benefit from tracking or being tracked. The tracking is a tool for creating the relationships between the users and services. It enables a personalization like recommendation services. It may be used for sometimes better user security like in banking systems. Also in some sense we can say that we are paying the service by being tracked. In terms of tracking the threat model here is not just being tracked. But more like they are tracking you through information in packets to correlate even unwanted traffic together, so what could be a better scenario. I want to kind of have tracking look like this diagram. Let’s make [indiscernible] her address from being correlated to the other queries. Similarly Bob may want to separate activities related to his [indiscernible] depression from the other activities. Even though those two sets of queries are coming from one [indiscernible] trackers cannot know whether they are coming from one host or two different end hosts. Again, I’m not arguing that. We need to get rid of whole tracking availability. Instead we want to provide the users with more control over what can be tracked. Or what can be linked together. By giving users over what to be tracked we could eliminate or even remove the problem of privacy risk. At the same time we believe that we can keep, we can still maintain the positive side of the tracking. Before jumping into the, our approach I wanted to understand how tracking works a little more. It’s just pretty simple again. The service that track users by linking their requests at the very low level, so multiple requests from a single host or single application can be linked from the remote side because they are sharing some identifiers. The identifier could be something like a cookie which is kind of most common one. But also like IP addresses giving pretty much information to the remote side especially when they are kind of combined with the fingerprinting information like some of the Window related ones like fonts from the operating system, etcetera. Therefore, users, we should have controls with an abstraction covering all different identifying features, not just cookie or not just IP address. We are calling this a pseudonym. We want to, each host manage a large number of unlinkable pseudonyms. Users or applications can choose which ones are used for which operation, so that the remote service can have limited ability to correlate those operations or activities. For example, there could be something like [indiscernible] pseudonym for one time per trace. From the previous example Alice may use a pseudonym for medical information and another one for her home address to separate them. Now this is an overview. Let’s see how we want to use pseudonyms. When Alice is using medical information she’s using one pseudonym. Then when she needs another one the application finds when or which pseudonym to use through the policy engine. It communicates with the operating system to locate more IP addresses. In turn the operating system needs to talk to the network to figure out okay I need more IP addresses from the DHCP and how can the packets from other remote service to be routed into my computer. Finally, the application now can use another pseudonym for the location related query. From this picture I will first explain how the application layer design should be. Then describe our application-layer designed to support the pseudonym abstraction. >>: I have a classification question. Are you talking about first party tracking or are you talking about the eighty-eight percent of third party tracking? >> Seungyeop Han: It doesn’t, it covers both of them. We are trying to say that, so that’s kind of more related to the how the policy [inaudible] designs… >>: But not really, right, because unless you also manipulate how browser handles cookies… >> Seungyeop Han: We many… >>: If I’m Google you log into me I put a cookie and I also embed my thing into all the websites you go to as a friend that you will load it to and you give me your cookie then I know which websites you go, right. >> Seungyeop Han: Right… >>: It doesn’t matter which IP you come from. >> Seungyeop Han: This is kind of cross layer design. We also modify the browser to manage, how they are managing the cookies, as well. >>: Oh, are you going to talk about it? >> Seungyeop Han: Little bit, yes. >>: Okay, alright. >> Seungyeop Han: It’s not that much about that but we will talk a little bit more about that at all, sure. >>: So that I understand this a little bit. Today if you start a browser in [indiscernible] I guess the IP address won’t change. >> Seungyeop Han: Right. >>: But you can login as a separate user there and do searches there. >> Seungyeop Han: Right. >>: Cookies would change. >> Seungyeop Han: Right. >>: Is IP the unique bit here that you’re supporting different IP addresses? >> Seungyeop Han: There are two different things. We are, the one is we are claiming that although we are mostly talking about IP and cookie here. But one thing is the identifier as a kind of, identifier should, or identifying abstraction should consider many different types of potential identifiers in the. Not only web browsing but let’s say web browsing context. For example IP address, cookie or something related to JavaScript, or some of the like web browser related information. Those things need to be considered. That’s kind of one argument here. The other is, yes, IP addresses is somewhat new bit, a little bit different bit from there. >>: I… >>: [indiscernible] system maybe you get, I mean if it’s a [indiscernible] system the IP addresses are going to change, right. What do you do about that? >> Seungyeop Han: Sorry? >>: If it’s a [indiscernible] system, right, [indiscernible] of a meaning. >> Seungyeop Han: That’s a good question. There is some, for example we could think about using that or a proxy for kind of hiding between them to have like many different hosts can share IP address. One hand we are talking about we want to give kind of control. It’s not; we are not trying to block the tracking ability. But is your question about like can we change the IP address even under the net? >>: We’ll take it offline. >> Seungyeop Han: Yeah, sure. Let’s, as mentioned I will talk the application-layer design first. Let’s assume that we have pseudonyms available. Then the question is application needs to find a way to determine how to use them. But it again depends on the user and the application. Sometimes people might want to have every packet to be different. Or they might change the ID or pseudonym for the different account. Or like changing by like tab changes or domain name or whatever, etcetera. As a system they design we don’t know actually how, I mean which pseudonym to be used. Here instead we are trying to build a flexible way for each application to define their own policy. For example, in web browsing policy can be defined as a function of the request information or the state of the browser. Or they may include something like unique ID for each Window, tab, or as [indiscernible]. >>: Is the mapping between activity and pseudonyms one to one? >> Seungyeop Han: Doesn’t need to be. Activity to the pseudonym is n to one and if they are mapped into the same pseudonym they can be kind of correlated from the remote side. That’s kind of how to, it depends on which, I mean which level do you want to allow them to access your activity as a correlated one? >>: It’s many to one. Can I have two pseudonyms for an activity? >> Seungyeop Han: No, I mean not exactly. It’s not, I don’t say it’s possible because maybe possible in a little more [indiscernible]. But more on like, activities more like per request we can see. >>: [indiscernible] >>: [indiscernible] quality design like I use multiple identities on the Chrome browser and sometimes even I don’t know which identity I should be using. >> Seungyeop Han: That’s fair. >>: Why doing a sudden task? >> Seungyeop Han: Yeah. >>: I don’t know how computer will decide that. >> Seungyeop Han: That’s fair. That’s a kind of separate question to answer actually. >>: Then that’s it, that’s the question I had in mind… >> Seungyeop Han: Yeah. >>: If you cannot do it then how can some people, probably cannot do it, as well. >>: What you’re saying for example if you have a medical application you might use a distinct pseudonym for that. If you have some media related stuff you might so they make very course policies that allow you to do that. Is that the level of which… >> Seungyeop Han: Yeah, and then people have been looking to how to for example difference say the cookies are over like when you are accessing facebook.com you have one say, pseudonym. If you are accessing Bing.com you have another pseudonym. There are many different types of polices as a literature. We are trying to make it possible to implement those policies. But whether we are not, we haven’t had a chance to look into, in detail what kind of policy would be more effective. That’s a kind of example what kind of policies could be possible? By default every request uses the same pseudonym. In this example the facebook.com can know that the user is reading some specific article on some news site, because it can correlate the cookie from its previous login to the Facebook. It’s coming from the like button there to the facebook.com. On the other hand we can think about a little more extreme case. Every request has this different pseudonym. But there are many different pseudonym policies in the middle. For example, where [indiscernible] can change pseudonym according to the pages domain name, the user connected which means that when a user visits news.com all the images, scripts are using the same pseudonym. But it uses different pseudonym when the user visits facebook.com. In this case, as well as the previous extreme case Facebook cannot correlate the user’s visit to Facebook what it’s reading in some article in some special website. >>: This is a tradeoff between privacy and convenience, right. Because our [indiscernible] P two different from P three. That means I won’t be able to click like… >> Seungyeop Han: No, the thing is you; well it’s from one of the previous work, not by me. But the problem of the like button is even if you are, you don’t click the like. It’s giving the information to the Facebook. It’s possible when you are clicking the like it’s connecting to the social [indiscernible] site at the time. Again, sure there will be some part, tradeoffs. But, we are not selling that this specific policy is effective or efficient. It’s claiming that there could be multiple policies that we can build up. Again, we briefly, I briefly explained how policies can limit servers tracking ability. Let’s move on to how we get support from the application-layer. Especially how can we assign many addresses to a single host? To support pseudonym abstraction again, we need to consider several different considerations. The first thing is when we need to assign IP address per pseudonym it just needs to have many IP addresses. Then those addresses, those many addresses should be properly mixed. If they are clustered together trackers can easily figure out okay those clusters are coming from a single host. On the other hand if we are just randomly assigning the addresses into the network, I mean each host has just has some random address from your network. Then the problem is routing table could be just [indiscernible]. We need to have the design for resolving these issues. Let’s first talk about how we can provide many IP addresses. It’s pretty simple. We are moving toward IPv6 although we don’t have any more, potentially any more IPv4 addresses. In IPv6 even a small network will get slot sixty-four IP block. But it’s much larger, even much larger than the whole IPv4 address space. With IPv6 we’ll have an environment where its host can get many more IP addresses rather than just one. Then if we look into the IP address in a packet the first part is used to route the packet into the network. The second part is used to route the packet within the network. We realize that as long as the network can deliver the packet to the end host. The address can encode many different information. Also with the long IP address with the one twenty-eight bits although much more flexibility to encode information into an address. What we did is we devised a very simple technique to assign seemingly random addresses to one host, but route packets still efficiently. We divided the second part of the address into three sub-parts. First is subnet ID and the second is host ID, and the third is pseudonym ID which is pretty much randomly assigned. The first two are kind of similar to what we currently have its internet. We are doing the longest prefix matching to route the packet. Then what we do is we are encrypting these three subparts together into the encrypted ID using symmetric encryption. End-hosts know only encrypted IP addresses so whenever it needs a new IP address it requests from the, it sends a request to the network. Then the network knows those subnet and host ID and assigning the new pseudonym ID and, by after encryption it gives it to the end-host. Router uses the base addresses to forward packets. It can decrypt those address parts and see the original subnet ID and host ID to write the packet. As mentioned, these two parts and the routing mechanism haven’t changed it from the current internet. We can still have the same size of routing table and same efficiency of the routing protocol. Let me show a, I mean with an example. When destination server is sending a packet it has a prefix and encrypted ID because a prefix it can use something like a PGP to figure out okay the packet should go to the network. After it arrives into the network it can decrypt the encrypted ID part and see what’s next to [indiscernible], in the same way with what we do currently. It repeats the operation until the packet arrives into the end-host. By this, again, we can have the still efficient algorithm for routing while multiple, I mean many, many IP addresses can be assigned to the single host. As proof of concept we implemented the prototype again which approximates our system design. Because we didn’t have an IP basics [indiscernible] to control was what we did is we were using kind of up, sorry let me step back. We didn’t have IP basics [indiscernible] but there was IPv6 Tunnel Broker we could use for building this. Let me explain how we approximate the design. For the policy engine what we did as we were building browser extension, at the time we were using Chrome. Browser extension can have policy functions like in JavaScript to say like how, which pseudonym to be used for which activity. As mentioned we were using IPv6 Tunnel Broker which is connected to our Gateway server. In Gateway server they are maintaining IP addresses from that network. It works as a web proxy. When Chrome, I mean when our extension is sending the request it can tag the request with pseudonym ID. The Gateway assigns that IP, the IP address matching with the pseudonym ID to the output socket. Any questions, no. Based on the prototype here we look into part of the evaluation. First we wanted to examine whether we can build various policies into our design. To the [indiscernible] we looked into the protection mechanism out there as mentioned there are several different papers talking about how we can protect in like the cookies, or extra for the third party tracker blocking. We could implement various protection mechanisms from the related work in a cross-layer manner. Meaning we could have IP and cookie at the same time for, in the protection mechanism. Most of them are pretty straightforward and they include something like very simple trivial or extreme case as mentioned in the previous slides. Or depending on a little more information like per first party domain I showed previously. Then we looked at the tradeoffs, how much activities are exposed to the third party. Here we are looking at the third party blocking a little bit with the number of pseudonyms. Because say if you’re changing the pseudonym every request it can limit quite a bit. I mean the third party cannot track any of the activity but it needs quite large number of pseudonyms. Based on, we collected choices for like three days with end users, pretty small size of trace. As you can see the, from the red line it’s showing the average number of activities observed by third party. For the kind of middle policies it could effectively register number of activities while requiring not too many pseudonyms. In summary; I introduced a new abstraction code pseudonym which allows flexible user control over unlinkable identities. To enable that we provide the new network addressing and routing mechanism which exploit the [indiscernible] IPv6 address space. Our system enables various policies with expressive policy framework. Before finishing up I will briefly introduce some of our other work and the future directions. In addition to security related work I just presented I have explored many different research directions. [indiscernible] systems but with connection to other fields like [indiscernible] computing or machine learning. Actually I had lots of collaboration with MS [indiscernible] folks. For example I have worked on the voice and the vision interactions with mobile devices. Like how we could enable the natural language interfaces in the Smart Phone applications. Or continuous mobile perception and DNN execution engine for the mobile devices. I also did some like ATCI and machine learning and a little more security acceleration stuff, as well. Looking forward, I kind of plan to continue the research in systems and security or privacy, one example I’m looking into as a new scalable micropayment system. We know that Bitcoin or block chain has some potential to be practically used. But it has several limitations including its scalability. I’m looking into how we can build some scalable block chain mechanism for, especially looking into cases like incentivize Tor users to pay the [indiscernible]. That’s kind of micropayment between those peer to peer like systems. The second one is from the computer vision related direction but more from the system side. Many researchers have looked into how can we actually write the training because it takes so much time? But in the future if many, applications are using DNNs for their services. It’s apparently we’ll have lots of requests coming to the Cloud. We need to; I mean the Cloud service needs to serve those requests more efficiently. It would be a very interesting question from the system point of how we can design the like scheduling for or research the location for handling those large numbers of requests for the DNNs. Finally, the last one is related to both of the some DNNs and privacy like we are kind of going to see some of more wearable devices like the Google Glasses. Or some more, like I want to say the AR devices like HoloLens, etcetera. That kind of scenario we could expect the continuous vision would pose a lot of challenges in privacy, as well. Because it will have lots of input from the devices and how shall the application can handle those input without violating too much, or not privacy. Also as is shown I have very broad interest across many different areas in computer science. I hope to collaborate with not only in systems but in other fields, as well. It’s the final slide. Today I presented how we can give users more control in the untrusted or unreliable environments. As examples I talked about two different systems. Thank you very much and I will be happy to take any further remaining questions. [applause] >>: What does [indiscernible] when you look at your future work that you’re proposing. >> Seungyeop Han: Yeah. >>: It seems like a [indiscernible] a broad [indiscernible] of things. For example there’s some that are privacy and wearable stuff for example. What’s the connection? I mean you know how do we make sense of your interest across all, what’s the connection within these things basically? >> Seungyeop Han: Well, sure I mean connections are somewhat in between two items and two items or something like this. I can’t say that there’s an overall single theme I want to work on. But something I want to try to figure out what kinds of interesting problems is coming given the like new emerging system, like micro-payment might be. The other is like computer vision through wearables. There are multiple different kind of problems are there. >>: [indiscernible] see is a connection within micro-payments in the MetaSync work and your [indiscernible] network, and your [indiscernible] scalable work. But in privacy in wearables what is the, what’s the angle, or what is the connection there? >> Seungyeop Han: It’s hard to say that it’s connected to the other two or something. It’s connected to like some other my privacy related work. Also somewhat natural to be interested in this problem because I’ve been working with some of the wearable devices to, I mean mobile devices to say how can we process the computer vision related stuff? >>: Actually so there’s some mobile, okay. >>: Pseudonym… >>: But he’s done some work on mobile… >>: I had a question. >>: [indiscernible]. >>: This is maybe more of sort of a nitty gritty question rather than more of a vision question that [indiscernible] was alluding to. But in MetaSync [indiscernible] asked this question about trying to use the storage service that might provide block storage interface as opposed to you know Google Drive which provides [indiscernible] stored files, right. You went with sort of the later where you know it is more user facing. >> Seungyeop Han: Right. >>: But one of the applications that you use pretty aggressively in that setting is you know shared document editing, right? Do you consider that as a workload? Because I would imagine if you’re editing the same file then I wasn’t able to get an idea of what entry goes into the log. You know you showed us Paxos log, log of operations. >> Seungyeop Han: Right. >>: But it’s very possible that we’re editing the same file; multiple users are editing the same file. That would result in a lot of round trips just too sort of reach an agreement as to you know what edit should go in next. >> Seungyeop Han: Right, so that’s a good question. That’s again, that’s somewhat similar when you are writing on paper together with other colleagues through git. It’s, I’m not sure how much conflict you had, I had quite a bit, but. >>: I think that is a different setting where you have multiple files. You can divvy up your files ahead of time. But if you just consider Google Doc for example. >> Seungyeop Han: Right, right, so it’s a, it’s not targeting the Google Doc like collaborative web browsing, like, how can I say it’s kind of a special application in the setting. We potentially, it would potentially be possible to build such application on top of these APIs. But currently what we have is there is a folder and the log is, kind of logs in the Paxos as determined, what’s the next version of the current folder is. >>: Not in the file level but in the folder level, okay. >> Matthai Philipose: Thank you. >> Seungyeop Han: Thanks. >>: [indiscernible] >> Seungyeop Han: Sure. >>: Who’s after me?