>> Vivek Narasayya: Okay, so welcome everybody. I’m Vivek... pleasure to introduce two speakers from MIT. They are...

>> Vivek Narasayya: Okay, so welcome everybody. I’m Vivek Narasayya and today it is my pleasure to introduce two speakers from MIT. They are both Ph.D students working with Samuel Madden. We are going to have two independent topics and two different projects that they are working on. We will talk about that, but before they get started we also have a surprise speaker. Her name is Elizabeth Bruce. She is the executive director of the big data initiative at MIT. She just wants to give you a very quick overview of what the big data initiative is and then we’ll have the two technical talks. So with that here’s Elizabeth. >> Elizabeth Bruce: All right, thanks Vivek. So I’m going to be super fast. I just thought I would give some context, because Microsoft is one of the founding sponsors and members of our big data initiative at MIT. So I’ll just tell you like in two seconds what we do with the initiative, what our focus is and then of course turn it over to the students to give the talks. So the big data research initiative was founded in 2012. Basically we are based at CSAIL, which is the computer science and AI lab at MIT, which is the biggest interdepartmental lab at MIT with over 100 PI’s, about 1,000 members and about 500 or so graduate students, PhD masters and some undergraduate. So the big data initiative is really a cross section of about 35 faculties that kind of cross everything from data ingest, data storage, through predictive analytics and machine learning, through visualization and different application areas. So we try to really pull together the research themes around the computation, the system and the tools that are needed to manage big data with the predictive analytics and kind of algorithms. They need to be developed for big data, as well as big data in privacy, which we feel is –. As we started running workshops on the challenges of big data, privacy is something that comes up almost every time you have a conversation. It’s complex, because it’s not just a technology solution like encrypt everything, but it has to do with kind of policy, technology and applications. So we’ve actually run a number of workshops for emerging technologies for managing privacy. Then the last of course is the applications or kind of these domain areas where we are applying and looking at the new emerging ways to use big data, whether it’s in predicting crime, predicting heart attacks, and predicting learning and so on. There are a lot of applications and researchers at MIT that are working in these areas. So very quickly that’s the research theme. The initiative itself really tries to bridge the gap between research and industry. So it’s real industry collaboration. Microsoft as well as well as 12 other companies partner with this initiative. We seed fund new projects in big data. We also try to run workshops. We have meetings, student poster sessions, and all this great stuff. We ran a big data challenge last year where we released a year’s worth of taxi data from the city of Austin and combined it with Twitter data, other public transportation data and basically had prediction and visualization challenges. So we try to do fun stuff like that and get big data sets available for students to use. We also this past year ran a workshop with the White House on big data privacy. So we try to look at how we can have impact and reach out to the community beyond just MIT. So that’s it in a nutshell. As we go forward one of the things that I would like to focus on and have been focusing on of course is working with industry partners. So, all those things around big data and data science and how to increase innovation and get things out of the research lab into industry in practice, how can we accelerate that? How can we help foster that? And having meetings, getting people together, giving talks, all of those things are great, but we’re also looking at ways to do better prototypes, demonstrations, use cases, modeling and working with our partners. So anybody who has input on that I would love to talk to you. So I look forward to visiting Microsoft over the next couple of days and I’m going to turn it over to Rebecca who will give the first talk. All right, thank you. >> Rebecca Yale Taft: All right. Thanks Elizabeth and thanks Vivek for having us. Thank you all for coming. So as Elizabeth mentioned I’m Rebecca Taft. I am a student in the database group at MIT. So today I’m going to tell you about E-Store which is an elastic database system that we developed over the last year. This is a collaboration mostly between MIT and the Qatar Computing Research Institute, QCRI, but you can see we are also representing Northwestern, U Chicago and CMU, so covering all our bases. So the problem we are trying to solve is that many OLTP workloads suffer from extreme skew in their workload. Just to give you a few examples: 40-60 percent of the New York stock exchange trading volume is just on 40 specific stocks, also many applications have variation throughout the day. So as you would expect most applications have a heavier load during the day when people are awake, but then you can also think of some applications. One of my favorites in Cambridge is Insomnia Cookies, which delivers cookies to your door any time of day or night. You can order them online. So I would imagine in their case the peak load is around somewhere between midnight and 3am. So some of the examples are like that. Other applications have seasonal variations. So ski resorts are obviously going to have heavier load in the winter months. Then you can think of outdoor water parks are likely to have heavier load in the summer. We could also have load spike. So it’s well documented that the first and last 10 minutes of the trading day have 10x, the average volume. Then lastly one thing we like to talk about is the hockey stick effect. So if a new application goes viral or a new product is released from an existing company that get’s very popular very quickly the IT staff are often struggling to provide enough IT resources in order to meet that demand. Okay, so skew clearly exists, but is it actually a problem? So in order to answer that question we looked at 3 different workloads. The first one was uniform data access. The second one was sort of a gradual low skew where two-thirds of the queries access the top one-third of data items. Then we also looked at a very high skew case where there were just a very few, very hot items that are getting most of the accesses from the transactions. So we ran YCSB workload with these three different skews and what we found was that skew actually matters a lot. As you can see from these pictures compared with on skew, high skew actually increases [indiscernible] by up to 10x and can decrease the throughput by 4x. And this is especially a problem in shared nothing systems, because resources that idol can’t easily help out resources that are overburdened. So how do we solve this problem? One solution that many companies use is to just provision resources for peak load. So what that means is that 90 percent of the time you have a whole bunch of idol resources sitting around doing nothing and they are ready for that 10 percent of the time where you have peak load, but of course that’s very expensive because you are not using these resources most of the time. Another option is just to limit the load on the system. So the system will perform fine most of the time, but then at peak load things could crash and it can just perform terribly, which is not really acceptable for applications that require high availability. Obviously a desirable solution would be to enable the system to elastically scale in or out to dynamically adapt to changes in load, which is what this work is about, although obviously it takes a little bit more thought than the first two options. So what would this elastic scaling look like? So suppose we have a horizontally partitioned database like you see here. It’s a very simple schema, just a key value store effectively. We have three machines: the first one has keys 1-4, the second one keys 5-8 and the third has keys 912. Now suppose keys 9-12 are super popular, so this machine is under heavy load, on fire, so how are we going to handle this? A simple option is just to add a fourth machine and offload some of the data items from the third machine to the fourth. You could also have another scenario. Here we have the same schema key value store, same partitioning, but in this case we just have one key that’s very hot, key number 9. Suppose the rest of the cluster is handling the load just fine, it’s not overloaded. In this case what we would really like to do is give key 9 a machine all to itself and offload keys 10-12 to the other machines. So this kind of brings us to the question: what do we do if only a few specific Tuples are very hot? So as I alluded to in the previous slide you probably want to actually deal with them separately, which is why we propose this 2 tiered partitioning scheme. So the top tier we are dealing with individual Hot Tuples and mapping them explicitly to partitions. Then at the bottom tier we are just dealing with larger blocks of Cold Tuples, which are either hash or range partitioned at course granularity. So there are a few possible implementations for this. The first one is we can do fine grained range partitioning. You could also thing of doing consistent hashing with virtual nodes or using a look up table for the top tier and then the bottom tier could just be any standard partitioning scheme, either hash or range partitioning. In this work we are using fine grained ridge partitioning, just because it was the simplest to implement, but any of these three options would work fine. It’s important to note though –. Oh, yeah sure a question? >>: [inaudible]. >> Rebecca Yale Taft: Yeah, that’s a great question. So the question was: are we assuming that these Hot Tuples are always hot or are they changing over time? So the system is supposed to be adaptive and I will go into that in more detail, but basically we are not assuming that the hot Tuples are always the same. They might change over time, in which case we need some sort of threshold. If they are only hot for a second we don’t want to reconfigure the system to deal with that case, but if we see that now Justin Bieber’s Tweets are super popular so we want to handle those. But, if a new celebrity comes along in the future then we will deal with that separately. >> [inaudible]. >> Rebecca Yale Taft: Exactly, I will go into that a little bit later. So most existing systems are one-tiered as opposed to this two-tiered approach. So they only partition data at a very course granularity and therefore can’t handle the case that I just talked about with a single hot or a couple of hot partitioning keys, which is why we present E-Store, which is the end to end system that we’ve built which extends the H-Store database system, which I will talk about in a second. It uses automatic, adaptive, two-tiered elastic partitioning. So I’ll break that down a little bit in the next few slides. But here is the obligatory architecture diagram. So you can see that this was built on top of a database. In this case H-Store, but it could be any shared nothing main memory database. Then E-Store itself consists of two main components: E-Planner and E-Monitor. So as the name implies E-Monitor is responsible for monitoring the system and detecting when there is a load imbalance. So it uses some course grain system level statistics and also some fine grained Tuple level monitoring, which I will talk about in more detail, but then it passes that data to E-Planner, which actually figures out the best way to re-partition the system so as to balance the workload. So E-Planner is not actually moving the data it’s just figuring out the best new partitioning scheme. So then E-Planner passes this new plan to the Squall System, which is actually a separate project outside of E-Store, but we did a lot of work on Squall for the E-Store project and tuned it specifically for this system. So I’ll talk a little bit about that, but we just got it accepted to SIGMOD. So if you’re interested in learning more about Squall you should check out SIGMOD in Australia next year. All right. So those of you not familiar with H-Store I’ll just give a quick overview. So as I mentioned it’s a main memory, shared nothing database. Data is horizontally partitioned across the cluster in the same way, in that toy example I gave previously where you have some number of keys on one partition and the other keys on the other partitions. So in this case each node is responsible for some number of partitions, which is probably equal to the number of cores no that machine, because as you can see here each partition is served by one single threaded execution engine, which is going to actually be pinned to a specific core on that machine. So another feature of H-Store is that it is optimized for stored procedures, which is basically parameterized SQL code intermixed with Java control code. So in order to run a transaction a client simply has to pass the name of the stored procedure along with the input parameters and that’s very efficient. It can handle general SQL created as well, but those are much less efficient. All right. So I will talk a little bit about the E-Store life cycle now. So under normal operation H-Store is running normally and then we have in the background a constant high-level monitoring. We are just basically checking the CPU usage every minute or so. I will talk about that in more detail, but as soon as we detect an imbalance more detailed monitoring is triggered. So in this case we are actually counting the number of times each Tuple in the database is accessed. And this is not free, there is some overhead, but we are only trying this on for a short period of time. So we think it’s a reasonable tradeoff. Now once we clicked this data we pass that to E-Planner, which actually figures out the best way to repartition the database in order to balance the workload and it also figures out if we need to either add or remove machines if the entire system is overloaded or under loaded. Then it passes this new partition plan to Squall, which does online reconfiguration. So it’s important to note that during this entire life cycle the system never goes offline. It’s always executing transactions and trying to have a minimal impact on performance. Then once reconfiguration is done we go back to normal operation and this is supposed to be kind of a continual life cycle. So as soon as we detect a new imbalance we might do a new reconfiguration. We can constantly adapt to changes in the workload. All right. So now I’ll talk about each of these in more detail. So the high-level monitoring as I mentioned, as I mentioned, we’re collecting system statistics every minute or so and we use the CPU specifically to determine whether to add or remove nodes or whether we just need to reshuffle the data. This works especially well for H-Store since, as I mentioned, individual data partitions are pinned to specific cores on the machine. So if we see that one particular core has close to 100 percent CPU usage we know that particular data partition is probably responsible for the overload. And CPU usages are cheap to collect so it’s a good high-level monitoring tool. Oh yeah, question? >>: So one minutes sounds like fairly fine granularity, but in fact you are executing zillions of things per second and one minute can be an eternity. So you could have dramatic changes in load that appear at a much finer granularity than one minute. >> Rebecca Yale Taft: Yeah, that’s definitely a good point and our system is not really designed to handle those sorts of very, very fast shifts. We are sort of handling the case of more gradual shifts over the course of a few minutes or hours. >>: Do you have any usage data of how a load might shift and why? >> Rebecca Yale Taft: Uh, like a real example? >>: Yeah. >> Rebecca Yale Taft: We don’t have any real data that we’ve worked with, but just some of the examples that I gave in the beginning like the New York Stock Exchange and the variation throughout the day. >>: It’s sort of a [indiscernible] question to ask. >> Rebecca Yale Taft: But, yeah, I will show some results with artificial benchmarks that we have used. We don’t have any real data unfortunately. >>: Is the sampling rate kind of tied to how fast you’re willing to reconfigure? In the sense that like the H-Store presumably would be tripled to collect like once every 10 [indiscernible]. >> Rebecca Yale Taft: Right, exactly, yeah. We are doing the low level monitoring –. >>: [inaudible]. >> Rebecca Yale Taft: Right, exactly. So as soon as we detect a load imbalance, either the system is overloaded or under loaded, then we trigger a more detailed monitoring. So that’s the next phase of the life cycle and this is when we actually collect Tuple level statistics. So this low-level E-Monitor finds the top 1 percent of Tuples that are accessed per partition. So either the Tupple is read or written during a 10 second window. Then it also finds some aggregate statistics on the total access count per block of cold Tuples and for each entire partition. So we use these access counts as a proxy to determine the system load on each partition. And this actually works pretty well in the case of H-Store because we are dealing with a main memory system so we don’t have to worry about disk IO or anything like that. We are also dealing with an OLTP workload so we don’t have to think about complex analytics or anything. So actually Tuple access count is a pretty good proxy for a system load. And as I mentioned there is some minor performance degradation while we are collecting these statistics, because during each transaction we have to increment the access count for every Tuple that is accessed. But, since it’s only a 10 second window we figure it’s a reasonable tradeoff. As you can see here this is an example of the data that we collect during this 10 second window. In this case we have a cluster with 30 partitions, so 5 machines, each with 6 partitions and you can see we have the access count for each partition. This is, you’ll notice, a log scale. So this is a high skew case. Partition 0 has many, many more accesses than the remaining 29 partitions. >>: Why not track CPU, like CPU cycles [inaudible]? >> Rebecca Yale Taft: Well we do that at the course grain level, but it’s not really accurate enough to actually figure out exactly which data is causing –. >>: Are the accesses really uniformed in cost? [inaudible]. >> Rebecca Yale Taft: Yeah, no that’s a good point, but the problem with CPU is that it’s not actually specific to individual data items. It’s just at the level of the partition. So with the Tuple access counts we are saying, “Okay, this individual Tuple is getting access a lot. So this guy is responsible for a large percentage of the workload.” >> Sure, sure. >> Rebecca Yale Taft: So it’s a relative scale. We’re not necessarily saying that if this Tuple was accessed 100,000 times that means it’s a hot Tuple. It’s the top 1 percent of all the Tuples that are accessed. Does that make sense? >>: Yeah, it does. >> Rebecca Yale Taft: Okay, so once we have all these statistics we pass them to E-Planner which uses the current partitioning of the data as well as the system statistics and the individual Tuple counts from E-Monitor and it figures out whether to add or remove nodes. So if the overall average CPU across the cluster is very high it decides to add nodes and likewise if the overall CPU usage is very low it will take some nodes away. Otherwise it might say, “Okay, the average CPU is fine, but these particular partitions are overloaded so we need to just reshuffle the data.” And it also figures out exactly how to balance the load. So this is basically an optimization problem. We want to minimize the date of events since actually physically moving the data is not free while also balancing the system load or doing our best to come close to balanced. So in order to solve this optimization problem we tested five different data placement algorithms and I will go into each of these in a little more detail, but the first two are basically Integer Programming, very computationally intensive algorithms. And the reason we implemented these two even though they are not really practical for real-time systems was just to show the one-tiered and two-tiered approaches in their best light since theoretically given the right constraints these algorithms will produce an optimal packing of Tuples. Then we also looked at three different heuristic algorithms. First Fit, which I will go into a little bit later. Greedy, which is an attempt to balance the workload by just moving the hot Tuples and then Greedy Extended is like Greedy. It moves the hot Tuples, but if the system is still not balanced it will also consider moving blocks of colder Tuples. What we found is that Greedy Extended performs the best out of all of these algorithms and it’s also very fast. So I will go through –. Yeah? >> So two questions: one is data movement is expensive. Do you actually have to move the data or you just reassign it to a different partition? I mean isn’t it all in main memory? >> Rebecca Yale Taft: It is, but remember that we are dealing with different physical machines. So we still have to actually move it to a different machine. >>: [inaudible]? >>: So maybe I misunderstood, but it looked to me like you had one thread per core and aren’t you going to be moving between cores on the same machine? Isn’t that cheaper? >> Rebecca Yale Taft: That’s true, it is cheaper to move on the same machine verses moving to a different machine, which we are not actually modeling in E-Store, but that would be a great extension to this. >>: Okay and so when you partition the data is it done by ranges or is it some sort of hash partition? >> Rebecca Yale Taft: We are using ranges. So a fine grain ranges like a single Tuple is represented as 1 to 2 is Tuple 1, because we have an inclusive lower bound and exclusive upper bound to each range. >>: So if I have some [inaudible] in some range that could be more or less convenient to move that, right? I mean you have to sort of split the range in some way don’t you? >> Rebecca Yale Taft: Yeah, but that only effects the cache look up. It’s not actually changing the data layout other than moving that single Tuple. It’s just when you are executing transactions and trying to figure out which partition you need to access for each transaction it makes the hashing every so slightly slower because you have now two ranges you have to check instead of just one. >>: So you can assign multiple ranges to single core? Is that what you are saying? >> Rebecca Yale Taft: Yeah, yeah. >>: Oh, I see, all right. >>: Are the ranges that you keep track of themselves adaptive? Which is to say if you see that the number of axis’s for a particular range is high do you gradually divide that into smaller ranges kind of like adaptive histograms? Do you do that kind of thing? >> Rebecca Yale Taft: No, that would be a great idea. We are only dealing with just two tiers. Either individual hot Tuples or blocks of some uniform size. So in this example I think we are dealing with blocks of size 1,000 Tuples. But yea, that would be a great extension. Okay. All right. So I will go through in detail how the Greedy Extended algorithm works. Suppose we are starting out with this partition plan. This is the YCSB workload, which just has one user table. And here we have our database partition to cross three partitions: 0, 1 and 2. You can see 0 has keys 0 to 100,000, 1 has 100 to 200,000 and 2 has 200 to 300,000. So from EMonitor we’ve been given that there are three hot Tuples, keys 0, 1 and 2 and you can see the access count for each of these. Then for the colder blocks we also have the access count from EMonitor. So I will tell you that this plan on the right actually perfectly balances the workload. But, how do we get there from this plan? So that’s where the Greedy Extended algorithm comes in. All right. So this is what we have from E-Monitor. We have the original partition plan here and then you can see that this is very unbalanced right now. Partition 0 is handling 77,000 Tuple accesses over the 10 second window, whereas partition 2 only has 5,000. So if we perfectly balance the system the target cost per partition is 35,000 Tuple accesses. So that’s what we want to get to. Here again is the hot Tuple list as well as the cost per cold Tuple block. Yeah? >>: Is this, I think it would be for the entire cluster. >> Rebecca Yale Taft: Yes, this is the entire cluster. >>: So I am curios, isn’t the problem simpler essentially [inaudible], because the moment you have hot nodes, so let’s stop these scale up [inaudible], the moment you have hot nodes you can essentially create each hot node individually. Essentially say that now your IOP formulation [inaudible]. Isn’t a divide and conquer a reasonable approach to this in sort of trying to reoptimize a whole system? >> Rebecca Yale Taft: Yeah, that sounds exactly like what I’m about to show you. So I might misunderstand you, but take a look at this algorithm and see if this is what you had in mind. Yeah? >>: Well what I’m asking you is do you do it at a single node level? I mean is your query or data placement working on what each hot node essentially you can do [indiscernible]? >> Rebecca Yale Taft: Yeah, so that’s what the Greedy Extended algorithm does. It’s very locally optimal choices. We also implemented Integer Programming algorithms, which I will go into in a few slides. So that is looking at the global scenario, but this is just a local optimal movement between hot and cold partitions. But, I will walk you through it. So stop me afterwards if this is not what you had in mind. Did you have a question? >>: Do you take into account say correlation of how transactions access Tuples? If a transaction consistently accesses 2 hot Tuples in different ranges and you move one range to another machine you just made that transaction distributor [indiscernible]. >> Rebecca Yale Taft: That’s a great question, yeah. So I will talk about that a little bit at the end, but this version of E-Store does not consider distributed transactions, but that’s our next work. Yea? >>: I’ll wait. >> Rebecca Yale Taft: Okay, all right. So the way this algorithm works is we take the hottest Tuple from the hottest partition, in this case key 0 from partition 0, and move it to the closed partition, which in this case is partition 2. So here we update the ranges and we update the projected cost and then we iterate. So now again we moved the hottest Tuple from the hottest partition to the coldest partition, which now in this case is partition 1. So we update the ranges, update the cost and iterate again. Now we are moving key 2 from partition 0 to partition 2. And now we’ve gone through all of our hot Tuples, but you can see the load still isn’t quite balanced. 0 now has 40,000 accesses and 2 has 30,000. So now we start working through the blocks of cold ranges. So we take the hottest range from the hottest partition and move it to the coldest partition and now you can see we are perfectly balanced. So this is kind of a contrived example obviously. It’s not likely that we would get perfect balance in the real world, but we come within some threshold. >>: What’s the goal of like perfectly balancing things? Like to even out latency or something? >> Rebecca Yale Taft: Yeah, to even out latency and also increase throughput. >>: I mean isn’t there just a certain number of like operations per second that a particular node can do and if you’re under that threshold who cares other than potential latency. If you push your node over the threshold of which it can efficiently handle –. >>: Yeah but, [indiscernible]. >> Rebecca Yale Taft: I mean I guess it depends on your workload, but if you’re trying to handle as many transactions as you possibly can then a perfect balance is going to be better. >>: Does perfectly balance always optimize for that? >>: That’s what I kind of wondered. >> Rebecca Yale Taft: Yeah, it seems to be from our results, which I will show you in a few slides. Okay, so that’s the Greedy Extended algorithm and I will just talk briefly about the other heuristic algorithms. So Greedy is just like Greedy Extended, but it stops after we’ve gone through all the hot Tuples. So in this case we would end up with partition 0 at 40 thousand one at 35 and two at 30,000. So it’s pretty close to balance, but not perfectly balanced. So this approach will work well if you have a few really hot Tuples in the case of high skew, but if you’re dealing with a lower skew situation it might not actually be able to sufficiently balance the workload. The other heuristic algorithm is First Fit, which first packs hot Tuples onto partitions. So this is a global re-partitioning of the data, not taking into account where the data is currently residing. So it first packs hot Tuples onto partitions filling one at a time and then once it’s done with the hot Tuples it packs the colder blocks of Tuples, again filling the partitions one at a time. This results in a really balanced workload. So probably close to optimal, but the problem is since it’s not taking into account where the current data resides the actual migration could be very expensive. Then as I mentioned we also implemented two Integer Linear Programming algorithms, which are intended to find the optimal packing giving certain constraints. So for the Two-Tiered Bin Packer we try to optimally pack hot Tuples and cold blocks. And the constraints are as you would expect that each hot Tuple and cold block must be assigned to exactly one partition and then also each partition at the end much have a total load that’s less than the average plus 5 percent. So 5 percent is this little buffer zone. We could increase or decrease that, but 5 percent it what we used for the paper. The goal is to minimize the amount of data moved in order to satisfy the constraints. The One-Tiered Bin Packer is just like the Two-Tiered Bin Packer, but we are only working with blocks of cold Tuples. We don’t have access to the individual hot Tuples. Both of these are very computationally intensive, but the idea is that they both show each of these approaches in their best light. So this will allow us to compare the Two-Tiered and One-Tiered approaches and see if Two-Tiered really does have an improvement over One-Tiered. Okay, so we used one of those 5 algorithms to create a new partitioning plan and now we pass that to Squall, which physically moves the data while the system is live. So this doesn’t sound that hard, right, you just have to delete the data from one partition and insert it into another partition. But, what makes this tricky is that Squally conforms to the H-Store single threaded model. So while data is moving no transactions can be executed. So if you’re moving a large amount of data you’re effectively taking the system down for some period of time while you move the data and then bring it back up. This is sort of like a stop and copy model if you’re familiar with different kinds of migration schemes. But, that’s not really acceptable if your application requires high availability. So to avoid that Squall moves small chunks of data at a time and interleaves those with regular transaction execution. So the size of the chunks and the amount of time between each chunk can be tuned in order to kind of walk this tradeoff between the length of reconfiguration and the performance impact during reconfiguration. So as I mentioned we did a little bit of tuning of squall specifically for E-Store and one of the optimizations we added was that we immediately moved data from the hottest partition to the coldest partition in order to get the benefit of the migration most quickly. Yeah? >>: [inaudible]. >> Rebecca Yale Taft: In this case it’s just a single copy. Either the data is fully replicated, so we have a couple of tables that are read only or read mostly and those are fully replicated. Otherwise we just have a single replica. >>: So I’m still a little bit confused about the model. So if you are all on a single machine then moving data between one core and another simply means atomically changing how it’s accessed as opposed to actual data movement, isn’t that right? >>: [inaudible]. >> Rebecca Yale Taft: Yeah, it’s not only a single machine, but you are right it is cheaper to move between cores on the same machine. >>: So you’re talking about actually separate CPU’s with data movement between one main memory and another? >> Rebecca Yale Taft: Yeah. >>: Okay, great. >>: How do you ensure that you are not reading duplicate data? [indiscernible] >> Rebecca Yale Taft: Yeah, no that’s a good question and that’s a lot of what Squally does. It is making sure that the data movement is transactionally consistent and you are keeping track of the data and making sure that you are not doing duplicate reads. So again, this is a different project so I’m not going to go into too much detail on how that works, but we can talk afterwards if you’re interested. Okay, so let’s look at the results. So these are two time series charts of the transaction throughput of the system over a 10 minute window. Here we’re looking at the two optimal Bin Packers: the One-Tiered and the Two-Tiered Bin Packer. The One-Tiered is this light dotted red line and the Two-Tiered is this kind of gold dashed line. >>: So what’s the workload here? >> Rebecca Yale Taft: Sorry this is a YCSB workload. The top one is with the high skew workload. You know very few hot Tuples and the bottom one is the low skew. >>: Are the transactions accessing multiple data elements? >> Rebecca Yale Taft: No, so this YCSB workload I think we had something like 25 percent updates of individual keys, or 15 percent updates and 85 percent reads of individual keys. >>: And each transaction accesses one data element? >> Rebecca Yale Taft: Yeah, exactly. YCSB does have a scan operation, which again whoever asked the question left, but –. >>: It was Justin. >> Rebecca Yale Taft: Yeah, that would get into the possibility for creating distributed transactions when you move data. So we weren’t concerned with that for this version of E-Store so we just dealt with the individual keys transactions. >>: So these are single record transactions. >> Rebecca Yale Taft: Exactly, yeah. >>: How many machines and how much tier [indiscernible]? >> Rebecca Yale Taft: So this is 5 machines and 6 partitions per machine, so a total of 30 partitions. I think we had 24 gigs of RAM per machine. I mean all the data fit in memory. There was no anti-caching or anything like that. Okay, so one thing to note: this dashed grey line at the top is the throughout of the system with no skew. So this is kind of the theoretical best you could possibly do. We start the E-Store life cycle around 30 seconds. So you can see for the low skew case we are starting at around 75 thousand transactions per second and the high skew case it’s around 30,000 transactions per second. So then once we trigger E-Store we do the low-level monitoring, E-Planner figures out the best repartitioning of the data and then Squall moves the data. You can see that for the low skew case both the One-Tiered and the Two-Tiered Bin Packer perform similarly. They both increase a throughput to over 100,000 transactions per second. In the high skew case however, the One-Tiered Bin Packer barely has any effect at all whereas the Two-Tiered Bin Packer dramatically improves the throughput. So you can see that depending on the workload it makes a huge difference whether you’re dealing with One-Tiered or Two-Tiered, but we are showing here that the Two-Tiered approach is a little bit more flexible and can deal with different kinds of skew. All right. So then we also compared the three different heuristic algorithms and you can see here that for the high skew case all 3 performed reasonably well, but for the low skew case the Greedy algorithm really couldn’t improve the throughput at all, if anything it made it worse. Greedy Extended performs very well in both cases. First Fit also performs well, but you’ll notice that because it’s moving so much data it doesn’t even finish the reconfiguration over the 10 minute window. So that’s why you can see the throughput going up and down. So that’s why Greedy Extended seems to be the best all around choice. This is just another view of the same data including the latency. The main take away here is that the Greedy Extended algorithm has low latency and high throughput for both the high skew and low skew workloads. The other algorithms are all somewhere in the middle. Yeah? >>: [indiscernible]. >> Rebecca Yale Taft: We don’t have that data unfortunately. >>: [indiscernible]. >> Rebecca Yale Taft: Yeah, not this is after reconfiguration is complete. You can see in the previous slide reconfiguration is pretty fast in this case. >>: [indiscernible]. >> Rebecca Yale Taft: Yeah, exactly, in these experiments we are just mitigating that one skew, but in theory the skew could change over time and then we would continually reconfigure the system in order to accommodate that. >>: Yeah, it would be useful to get your hands on some real data that has that type of skew, because kind of the thesis of this whole thing is that things are constantly changing so that as a result doing these dynamic reconfigurations will significantly improve the throughput of the system. So for instance it would be nice to know if you’re using a 1 minute threshold how long do the skew conditions that you are adjusting for typically stay? >> Rebecca Yale Taft: Yeah, I think it depends a lot on the application. So that would be nice to look at real data, but anyway in the absence we created our own. So anyway, I’m about done, but –. >> Vivek Narasayya: So can we wrap up in like 2 to 3 minutes? >> Rebecca Yale Taft: Yeah, I’m almost done. Okay, so as was mentioned earlier we are not really considering distributed transactions, which is okay most of the time for a lot of workloads when you are just dealing with mostly transactions that only access a single partitioning key. So in these 3 workloads YCSB, which I just talked about for awhile, Voter is a workload that sort of models American Idol, TPCC I’m sure many of you are familiar with and in that case everything is partitioned on warehouse ID. So you can think of the warehouse as the root of the tree and then everything else kind of dangles off of it. So each warehouse had 10 districts, each district has 30,000 customers and when you reconfigure the system you are going to move that entire block of data to make sure that you’re not increasing the number of distributed transactions. >>: [indiscernible]. >> Rebecca Yale Taft: Yeah, we are using the foreign queue relationships to track it. So we are working on E-Store++ right now, which will handle the general case. This will handle cases with many end to end relationships: graph, schemas, social networks, that sort of thing. So stay tuned. All right. So I will take some questions now. Yeah? >>: [indiscernible]. >> Rebecca Yale Taft: No that’s a great point and it’s something we considered doing. I mean we didn’t end up having enough time cycles to do it in this project, but that’s definitely something worth doing, kind of predicting when the spikes are going to happen or when there’s going to be an increase in load and maybe add resources a few minutes before that’s going to happen and then take them away a few minutes after it’s done. That would definitely be a good future area of study, but we haven’t done that yet. All right, thank you guys. [Applause] >> Anant Bhardwaj: Okay, it’s nice to be at MSR. Thank you for coming. I think this was supposed to be the first slide. This logo was created by Sam and it looks pretty nice. So I’m going to be talking about DataHub, which is a hosted platform for organizing, managing, sharing and doing a bunch of stuff for data. Basically I will go in history briefly and kind of motivate the problem of why we need a new data management system, what the problems are that exist, the recent trends of changes in data management systems, what we should be doing and what we should not be doing. This is a collaborated project with people from a bunch of different universities. You can see their names and my name is Anant and I work with Sam. So once upon a time this is what the applications used to look like. You had a database server, a bunch of applications, DBA would give you database instance and your application would read and write and that’s what it would look like. You would also have a data warehouse where you would move a lot of data from your online transaction processing database systems. The reason why you would move is because these are designed for high throughput transactions and for OLTP workload you want different architecture like [indiscernible] or you want bulk read and bulk write. So you this was architecture that existed maybe like 20 years ago. Things changed and people started using a bunch of other things like XML, JSON and the reasons why they started using it is that it’s not that they didn’t know that these systems are not as robust as RDBMS or that these systems have to give up asset in a bunch of other properties, but because it’s just super easy to use. Like if you are using JavaScript everything is in JSON and it’s awesome that you can write the entire JSON and some document store and can query. So basically end users or the age of [indiscernible] with which they could the data management system kind of like drove the usage of these kinds of data stores. And things change, because now that you have XML and JSON you have to do ETL to move the data in your data warehouse and then there are a bunch of ETL tools to transform and do other sorts of stuff before you can do analytics or business intelligence. Things changed more over the last decade or so with the Map Reduce paper by Jeff Dean. So (H) DFS or the [indiscernible] file system came in the warehouse where you would just dump your stuff in (H) DFS and sort of your ETL you have more Map Reduce. A lot of analytics and business intelligence would be done by writing this map reduce code. Again, it’s super slow and things didn’t work, but it’s a good abstraction. Some problems you don’t care about speed just like the example that they give in Map Reduce Paper [indiscernible], or the word count, or creating an inverted index. So these are like things you can do easily with (H) DFS. So why do you need DBMS? You can just move the stuff there. Now people thought we could do more. So they added a bunch of abstractions in (H) DFS. You have query language, you have mode query language, you have a Spark, which is in memory abstracts for doing machine learning and other things. So you can dump all the stuff that you compute from OLTP applications that are directly dumping the data to (H) DFS. Then you have go this mixed OLTP architecture where people are writing Spark, or Map Reduce, or SQL or different things. So this is where we have reached. You must have seen the Cloud, Data Lake thing. People basically went a step further and they said, “Like why do we need (H) DFS only in the analytics in machine learning? We can just move it all the way down to the real time processing. And they built a system like Hive. I don’t know if you know the [indiscernible] in Cassandra. The [indiscernible] build up on Google’s big table. Cassandra is based on Amazon’s Dynamo where applications can directly write the data. Consistencies are compromised a little. So Dynamo is more a provincial consistency. This one in HBase you can have only one single record logging. So there are bunch of [indiscernible], but so like for [indiscernible] for applications to directly write in the distributed file system and now you have got like your data management layer having a bunch of these documents in a big lake and you want to kind of like make sense of it. So like did it solve all our problems? Like did it give us a better way to manage all our data? And what are the lessons that we have learned through this whole evolution of data management systems over the last 20 years or so or 40 years or so? I don’t know if you guys went to CIDR a few days ago. This is a great paper I highly recommend reading, which is by Thomson Reulers guys and they say, “The value of data is directly proportional to the degree to which it is accessible. As this data grows in size, in type, in dimensionality and in complexity, accessibility becomes of paramount importance.” It means a very low value barrier to entry, but the problem is that you’ve got all this data, but it’s very difficult to discover. It’s very difficult to link. It’s very difficult to make sense. It’s very difficult to do anything for people who are really interested in data, especially end users like social scientists, analysts or other groups of people that are part of your organization. The lessons I think if you read this paper, if I have to summarize, is that end users don’t care about data models. They don’t care about its architecture. What they care about is the tasks. What they care about is if they can do the stuff that they want to get done. Data model is important, but not initially for end users, because they think about the goal that they have to achieve, the analysis that they have to make and for them whatever tool that best fits for their task they will use. A lot of people, like you guys have Microsoft Excel and I think there is a huge class of users who love to use Microsoft Excel because it gives a very low barrier to entry for them to do analysis that they want to do. So can we design data management systems keeping end users in mind rather than think too much about performance, or thinking too much about data models, consistency and all the stuff that we care too much about in databases? And who are the end users? Are the end users programmers who write code? I think your data management systems are mostly used by programmers. They write code which writes and reads from your data management system. Data scientists care about extracting information from the data. Business owners care about the numbers so that they can make decisions based on what charts and graphs you present to them. Data Administrators want to make sure that data is accessed only by those people who are authorized to use it. It’s not visible to those who should not be seeing it an all that sort of policies and other privacy things. I will take one sample class user who I have interacted with over the last few months and that class of user is: Journalists. So they have like these typical problems: they want to write articles about analyzing data that’s available on data.gov. So the first problem is how to first ingest that data into Excel or anything, but Excel is the only thing that they know. So they care about ingesting that data into Excel. Excel has a good ingest, especially if the data is structured. So sometimes it works and sometimes it doesn’t work when your commands are not delivered properly. You have got some missing delimited values or not properly ended delimiters. Once they have data in how do they query? You have like 20 different files and how do you kind of like link 20 different Excel files? I will tell you one story. So there is a guy called Adam [indiscernible] and the way he makes sense of his 20 files is he will take those files and put it into 20 Excel sheets. If he has to join across those 20 files he will run individual filters on the keys which he wants to join on those 20 different sheets. Then he will print those 20 sheets with the filtered results and then he will highlight with a color, like if he has to join by key K1 he will basically highlight all those 20 sheets by some color, which basically represents K1. Then he will basically trace back by the color and see the results he needs to properly understand. Yeah? >>: [inaudible]? >> Anant Bhardwaj: I do not know his version, but I think it is 2011 or some latest version. >>: [inaudible]. >> Anant Bhardwaj: So I heard there are tools that you get with SQL license, but I think the kind of Excel that runs on their machines has pretty naive Excel, the bare Excel that you have. They are not even aware of like other tools or plugins that they can install, also, through their e-mail, like what kind of e-mail to send and of course how to quickly make sense of it. How do you get a quick summary of all the data? So I don’t know if you can see the e-mail, this I like, “Hey, I uploaded several GSB files and now I have a big text file, which we cannot ingest properly.” So we had a tool for ingesting and he basically tried to copy 400,000 rows in a text box on a web page. It’s expected that it will crash, right. A browser is not supposed to handle such a large amount of data. He will say, “Hey, how can I get this ingested, because I want to use some cleaning algorithm or cleaning tool?” This is the first problem, ingestion. Whenever you have any file that spans across 20MB or something it becomes hard for them. This is another one: this is the kind of query that they ask, “Hey, I put all your files in DataHub. Can you give me answers to these questions?” They don’t know how to write SQL so their questions look like this: which individuals gave this much money to these people and can you tell me the answers? So these are the real queries that come from journalists and social scientists that use these tools. So we did a small survey of how people use data management systems and these are the kinds of tasks that they have to deal with. We provide query languages like SQL, but unfortunately that’s accessible only by a small fraction of user class. And they get super excited that you can kind of like join different data sets. So in DataHub, since it is based on databases, this guy is like, “Hey the commissioner is coming tomorrow and I wanted to show the potential of DataHub. So can you show me [indiscernible] or whatever, some different tables?” Then he says, “It’s not shocking that these results look the way it is, but the idea that you can run a simple cross query to generate this is super interesting to the commissioners.” So keeping these end user goals and these end user tasks in mind we built DataHub, which is a unified data management and collaboration platform for making data processing easy for small data, or big data or anything. Essentially it has two components: one is a flexible data store where you can put in anything like files, which you can convert into relational databases or other back ends that DataHub has, but setting less collaboration capabilities, because often people want to collaborate on their data sets. They want to collaborate [indiscernible], analyze and do things. So then App Ecosystem is basically designed for end users so that they can use those apps for doing the tasks that they care about. This is what it looks like. You have a core, which has some back end. It could be RDBMS, it could be document store extendable to other back ends and then you have this whole management layer which is collaboration roles, users, groups, the language support and depending on the back end it will have some query language. Then on top of it you have this App Ecosystem, which the end users would use for doing their tasks. So if somebody wants to do a sentiment analysis you would just basically point to his data set and say, “Do send him an analysis on my data.” So I will do a quick demo because that is a much easier way to demonstrate what we are trying to build before getting into the technical nitty gritty. Okay, so this is what –. Oh, is there a way I can connect to the internet? [Demo] So, most of the ideas have been taken from GitHub. Many of you might have used GitHub, which is a collaboration platform for code. We have a concept of repositories and inside your repositories anything can live. It can be files, it can be tables or it can be other stuff. Currently we support only relational database back ends, but you can have any back end like document store or whatever you want. I don’t know if this is visible and I cannot zoom in. So as a new user you basically can create a repository, let’s say Microsoft. So you have Microsoft repository, but no collaborator. If you want somebody else to collaborate, like if I want Sam to be able to do all the stuff that I’m doing I just add him and he would basically get the same view and can collaborate, add stuff, query and do other things. So once you have a repository you can either run a SQL, but usually what people like to do is just get their files in. So you can take some file, so this is real data. Let’s take some payroll data. This has a few hundred thousand rows, but hopefully it’s at the N. Once you have the file N you should be able to import it in the back end that you have. It automatically does some basic stuff that you have in Excel. It detects headers and delimiters and other things. You say, “Import,” and it appears in your table. You can have more tables, but many times your files are not going to be as simple as CSV’s. It’s going to be a complex, messy text file and you should be able to get that stuff in. So once you have files we talked about the app center ecosystem that you would be able to launch any app and would be able to use that for your tasks. Currently we haven’t built a great UI for the app center. So we have to go directly to the URL. So, one of the apps that we have built is for ingestion. So for instance if you point to a file, which like this is a [indiscernible] file and if you want to ingest this file N you give a couple of examples. Whenever you see a line like this I want to extract date, time, from IP address and to IP address and some of these lines might not have the IP addresses so just extract empty. As soon as you find this you click under find and it will basically extract everything in a table format and you can put a table name and save to the DataHub. You can have like multiline files, something like this one. This is pretty common in Excel or reports that the government generates. You will have some kind of text and then some data, then some data, then some text and then some data. So suppose you have this kind of file and you want to extract. So you basically say that you are going to find some reported crime in Alabama followed by your, crime rate. So whenever you see this sequence I want to extract the name of the state and the crime for these 5 years. So you can basically encode in your example the multiline examples too and you just say refine and now it will extract all those. So this is our ingest app. Once you have your tables in you might want to make a quick understanding of what your table looks like. So we just added this table 2006. Whenever you open this you can just launch DB pipes, which is another app that kind of like quickly analyzes entire data sets and gives you a distribution of all the different fields. So if you think like why the [indiscernible] in the state as figures. So you just select and you can kind of like interactively go deeper and visualize things. The idea here is like how you can build an App Ecosystem which end users can use for doing the tasks that they want to do. >>: How do you compare this work with what Trifacta is doing? >> Anant Bhardwaj: Trifacta is a good data cleaning system, but their goal is to clean data. So they are not trying to build a platform –. So basically I’m showing these apps as an example, but the emphasis is on the whole ecosystem and the platform that allows new people to write apps. This means that you can write apps which would be available to everyone else. So the building this app store of data apps is the idea so that end users can select the apps which fits for the task that they want to get done and use the stuff. So Trifacta could be one of the apps on that ecosystem. Yeah? >>: I was just wondering if you took a look at Google’s open define? >> Anant Bhardwaj: Yeah, the Google’s open define could also be one of the apps basically for defining. >>: [inaudible]. >> Anant Bhardwaj: Yeah, so I will show you more apps. So currently I’m just talking about the whole work flow, but end users basically have to do a bunch of those tasks and I’ll get into more stuff later. But, yeah, some of these individual point solutions do exist, like where you can put the data in and reduce stuff for you. But, we are trying to basically build a platform where you can connect these points of use instance and build your own data processing pipeline so that anyone can get the data analysis done without needing to write complex SQL code or complex script. You basically launch open define, then writes to this database and then launches this machine learning app and hen machine learning apps writes the output to some other database. This is usually how you do stuff today, right? How you can get this done without doing all that is the whole purpose. Okay, so another app is like [indiscernible]. So this is another app called [indiscernible]. You can just run a query and select different visualizations that you might need, then it will [indiscernible] and you can write it back. But, like how do you connect these apps? It’s not just like using these apps; sometimes many things are much easier to do by writing some simple code. Currently if you have to do some machine learning or some analysis in R you have to kind of move a lot of data from here to there. And we have built this whole interactive analysis environment within DataHub itself. So I think some you might have used IP[y]: Notebook and we have our own version of IP[y]: Notebook by extending IP[y]: Notebook so that you can do a lot of cool stuff within DataHub. So this is DataHub Notebook and this would be pretty cool. So you start a new Notebook called Microsoft and suppose you want to make some query. So we haven’t actually posted this on DataHub actual servers so I will use my local host DataHub version, which is this. So let’s look at say this table. So instead, maybe test. Okay, so suppose somebody wants to do some cool stuff by doing some query on this table and we will do that. So we have this, you can simply do a [indiscernible] in SQL and it will give you the result, but now you can assign this result to some Python variable called RES. You can now use that, but if you want to draw a pie chart you can simply say [demo]. If you want to do like [indiscernible], but suppose you are not satisfied with the tools that we provide. You are an expert in math lab, right. What if you want to use the math lab [indiscernible]? There is just some sample code because I don’t remember the math lab syntax, which is I think this, right; this is how you write the math lab. So you basically say, “Import this math lab [indiscernible],” and before that I have to put some variable. Then you can say, “Some value is equal to R of 1 and similarly [indiscernible] equal to,” So this requires 2 array and that’s it. So basically we kind of like took the results from DataHub, assigned it to 2 Python arrays and kind of like did a math lab [indiscernible]. But, many people are proficient in R. They say Y to U is math lab, because I want to use R. So how can you just pass these results directly in R, or Ruby, or Java or Scala? >>: Can you use all the libraries also? >> Anant Bhardwaj: Yeah, you can use all the libraries too. So let’s do a percent R. So somebody wants to do something like percent R and say mean of [indiscernible]. It will not work because the value is not defined, because this is an R object. So you can do this, percent R [indiscernible]. So this means that the variable that I did in some other programming language should be pushed back to some other programming language and oops I made a mistake. And now you’ve got the answer from R. The way we managed to get this done is because we have defined our own common language independent framework for result set, which is also how programmers use it. So we have some programmers how have built real applications using our SDK and I will show you that in detail, but basically the goal is that now this also lives in a repository so you can share with other programmers who can collaborate other things. Then you can drag these pie charts or something in a dashboard and send it to people who make decisions. This is all a collaborated platform. So people who you have shared with will have access to some of this stuff. >>: So what kind of data types do you have that sort of understand [inaudible]? I assume that not all of these environments have compatible [inaudible]. >> Anant Bhardwaj: Yeah, yeah, yeah, so I don’t know how many of you have used Thrift, our protocol buffer. So, a protocol buffer is a language used at Google for a cross language –. Like they build their services in any of the programming languages and the common language is Thrift. A common language is protocol buffer. So when Facebook basically started their company they thought why not build something from scratch? So they hired some of the Googler’s and they built Thrift and this is what it looks like. So, all the DataHub API’s and data types are N. So basically you can define all the languages that you want to support like the IOS, C++, Java and then you define all your –. So it supports like all the basic data types which is integer, [indiscernible], arrays, maps and [indiscernible], which basically all the languages provide an implementation of. If you want to generate your binding for a particular language you simply say, “Generate the binding of this interface in my language,” for or instance if somebody wants to write their whole app in R or Python. So that person will just do –. So which language do you want? So suppose you want Java and you do Thrift -, -, [indiscernible] Java and Thrift/. Now you will see [indiscernible] -, Java folder that’s created. So suppose if you want like C++ you go [indiscernible] rate of C++ binding. So now you will see [indiscernible] -, C++. So basically this allows you to be able to use any library in any language. So people can write their apps in a language of their choice. You just drag and drop your data to the apps that are hosted on DataHub in different languages and they all use this DataHub connection primary interface that we have defined. Without in data movement apps live in DataHub. They have all this privacy and access control and you can use all the best of the world apps in any language for your data without needing to worry about how to deal with it. Currently one of the major problems that people face is that their entire app is in Python, it is just awesome, but machine learning library in Java how to use it, right? But with DataHub we kind of like remove that entire problem because you can use the best [indiscernible] of any language that you feel would work best for your tasks. Yeah? >>: So Microsoft has a system compiled [indiscernible], which has some of these aspects like you can have a repository of things and can easily import these things in Excel, but what I think you have done is really in apps and the different languages you can write that. Then you can share some reports and things like that, but the ability to write apps on this data seems to be certain. But, there is a concept that you can have a repository or catalog of data, you can ingest it into that catalog and you can easily access it from Excel. That system is [inaudible]. >> Anant Bhardwaj: Yeah, I mean this seems to be a problem faced by every organization. So definitely people have built some solution that works with their organization and our goal has been to provide something that everybody else outside can use. So say some journalist wants to do some machine learning. The problem today is that if a journalist or social scientist wants to do machine learning he has to hire four computer scientists to be able to write a machine learning code. Even if you have the open source library available it requires you to spend a lot of time understanding the API’s. Then not just API’s, but configuring, installing and connecting those API’s through the various iterations and the various processes, because those API’s require data to be in some form and data frame or something. So go to the app center and say, “Hey, I want to use this machine learning app,” and app center would say, “This machine learning app requires –.” So for instance if you want to do sentiment analysis would require a column in your table which would have the text filed, which is your text, and it will write the sentiment in some other column that you provide. So you just provide two columns just drag and drop and say, “Do sentiment analysis,” and your sentiment analysis is done. That basically improves productivity for many, many class of users. And even for programmers, because programmers also spend I think 60-70 percent of their time doing configuration and building this whole data processing pipeline, rather than writing actual analysis code or [indiscernible]. Yeah? >>: How do you decide what to persist, any temporary results? >> Anant Bhardwaj: You can also persist. So basically whatever you’ve found –. So here you can say like, “Persist RES,” and that will write to table. >>: Okay, got ya. >> Anant Bhardwaj: Yeah, you can do that. So any result that you have found you can also persist. >>: [inaudible]. >> Anant Bhardwaj: Yeah. >>: [inaudible]. >> Anant Bhardwaj: So basically the way you search for apps on your phones, like what is the best app that tracks your running? So you basically go to the reviews and then try to use it for a day. If you don’t like it delete it and use something else. Similarly you are going to search for machine learning apps and then try to use that for your data. So for instance suppose somebody wrote a query builder to write query where you don’t need to write SQL, but that will generate query based on some UI accents that you perform. So I will show you an example. I think that would be a better way. So this is running on an experimental version. We haven’t posted it to actual –. Okay, so suppose you have a table and there is this data queue. Somebody wrote an app that says if you want to do a query I give you this interface, which is you add all the tables that you want. Currently I have only 1 table, these are the columns that I want and then you basically said, “Add more tables, and add filters or something.” If you like it you use it. If you don’t like it you might look for some other app that might do the same thing for you. Once, whatever result that comes, you persisted and then used a second app which basically takes this as an input and performs the next iteration. So that’s how I think end users would use these apps. So they will basically connect these apps to build a data processing pipeline. Does that answer your question? >>: [inaudible]. >> Anant Bhardwaj: Oh, so your app could also be –. So for instance like one app could be –. So this would be a great example: so we have written a sample code to do visualization. So instead of doing visualization through UI you basically convert this into an app and the way you convert this into an app is this: so this is some visualization code that –. I just copy this HTML in JavaScript code and I go to my –. Again, this supports multiple languages so currently it just prints HTML right. So what I do is I copy this entire page, any example code that I saw, and say run, which shows the same output that you saw on the web page. Now you can add an input box here which basically takes table name and do something. It produces some result, which you can persist back. Then pass it back to R, which can do something else. So you can take some of the like people writing some of the stuff outside example code or something, but you can connect those through this notebook that we are doing. You can connect and then push that to some other language and that language [indiscernible] can take your output generated from this app. Yeah? >>: I have one question: considering that you’re targeting someone who uses 20 sheets in Excel, what are your talks on let’s say the clear out between flexibility and learning curve? This is great you are actually providing so much flexibility in expanding the capabilities of setting up a data processing pipeline, but are you shifting the problem from let’s say the guy who’s now an expert with Excel and he now has to sort of start learning how this ecosystem work? What apps would he start using in order to get this process done? >> Anant Bhardwaj: Great question. So basically does an end user, like social scientist, need to know these IP[y]: Notebook and all of that? >>: Well I’m not even talking about a social scientist. Let’s state your general instincts on it. >> Anant Bhardwaj: Okay, yeah, so I don’t have the Excel app, but we’ve also built an Excel app that basically gives you [indiscernible] Excel view that you can just do copy paste and do all the Excel filtering and that converts that into underlying DataHub operations. So you basically see the Excel view, do all the stuff in Excel, but under the hood it’s doing all the relational queries. So a student, I think in Sam’s database project, built an Excel interface for doing all this stuff. So they still see the Excel interface. So basically the App Ecosystem becomes rich where you have like apps with the different flexibility and different interfaces, because one type of interface doesn’t work for everyone. Different classes of users have different expertise, they like different things, like you would not want to use Excel for doing these things. You would just use either the Notebook or you would use some query, right, because for you it’s much, much easier. So we just wanted to make sure that the app ecosystem is flexible enough which caters to every class of users. >>: I guess I just wanted to gage your thoughts on future [indiscernible]. So instead of targeting these different types of people, maybe are we better off targeting one set of people, like for instance just social scientists, but not intersect the entire thing with programmers as well who want much more fine grained [inaudible]? >> Anant Bhardwaj: So one of the problems is that often these classes of users collaborate. So for instance as programmer you would write all the Bing statistics maybe in different states. These many number of clicks by this class of users happened. So you would prefer to write some Python script or something that will write to something, like some table, but somebody else who wants to analyze would prefer to use Excel view that you can see the Excel view and quickly draw charts or something. And to enable collaboration between the classes of users we have to support these different facets. >>: So is it fair to say, if I am understanding the value proposition that you’re putting forward, basically what you’re saying is that if the underlying place where you sort of store and manage the data has –. You can argue that HDFS sort of does this too, except the problem is that HDFS doesn’t really understand what the data is. >> Anant Bhardwaj: Yeah. >>: And the value that you have here is that you have a framework for basically managing the data and sort of the life cycle for the data and by picking a data model based on this [indiscernible] that you’ve done, like saying, “Well I’m going to find a way to have the system understand the data stored in that data model,” and be able to push it to all of these different tools that some people might use you are creating a platform by which anybody can look at the data that they want and they don’t have to start by ingesting text or something like that. It’s understood that because it’s stored in this environment what the data is. Is that fair? >> Anant Bhardwaj: Yeah, kind of. So basically we wanted to create it for every end user so that it fits in the tasks that they have. Different classes of users have different goals. They have different roles –. >>: Right, but you can also argue that you can just store it just in a file system. >> Anant Bhardwaj: Well the problem is –. >>: I think the difference is that by basically saying, “We’re just going to insist that all data be understood as part of this serialize format.” You’re making it so that people don’t have to start with as un-specified [inaudible] when they want to use a tool that makes use of this [inaudible]. >> Anant Bhardwaj: Right, because once you have files then many things that you want to do you cannot do, because either tools do not understand or it just becomes hard. Then you lose all those guarantees that sometimes you need from the data. So suppose somebody is building some survey app they can just build an app, but [indiscernible] which surveys are written to DataHub. So your app looks like a [indiscernible] management system as a compliant app and now somebody who wants to analyze that survey will basically open Excel view and will basically drag and drop some sentiment analyzer app or some machine learning app or some clustering app which will do things and they will write it back to some other things that they feel have ease to collaborate with. >>: Right and I think you are saying that this is made possible by the fact that you created bridges to this standard serialize format that everybody can go to an from. >> Anant Bhardwaj: That is true, because this language is independent and schema independent. Yeah, it’s true. >> Vivek Narasayya: So do you have anything else to show? >> Anant Bhardwaj: Just two or three more minutes. I just have the research challenges. >> Vivek Narasayya: So maybe we can wrap up in a couple of minutes. >> Anant Bhardwaj: Yeah. Okay, so basically the focus for us has been accessibility means of very low barrier of entry and we wanted to make sure that we provide a very low value of entry for all class of users. That has been the goal through this DataHub system. So this was one of the workloads that we saw. You have raw files, you import refine, put in the database, then I did not go into technical details of how to do fork implanting, merging and all that, because I just wanted to give the high level picture of what the system means and why we want to build the system. The underlying architecture that supports it is this, which you have already seen before. So these are the challenges that we have while building these kinds of system, because we have to build a new architecture for storing and managing a large number of diverse data sets. Some data sets are going to be in some document stored back end or some other back end. Often if the query has to link between these data sets you can create [indiscernible], but it get’s complicated if you have data in so many different formats. The second challenge is the infrastructure for hosting large number data-processing apps, like how do you make sure that all the privacy and policy guidelines are properly managed? They are not leaking the data sets and other things. So how do you manage those policies with the apps because those are third party apps and they can do bad stuff? And automating large parts of data science pipeline by letting users connect these apps is another challenge, because often apps don’t provide good interfaces and they might not fit well in the expertise that user’s have. And these are the data science and design challenges, like how do you design a good user experience so that all class of users feel comfortable using these systems and they can complete their tasks effectively? And of course expert users are important, because they are the ones who efficiently do things. So they should be able to quickly do machine learning and analysis through the tools that they have been using. Like if somebody uses [indiscernible] or some C# machine learning algorithm then the person should be able to use that. And of course like enabling [indiscernible] ingest manage clean and visualize large number of potentially unstructured data sets is a challenge. So thank you so much I can take questions. >> Vivek Narasayya: Let’s thank Anant. [Applause]

>> Vivek Narasayya: Okay, so welcome everybody. I’m Vivek... pleasure to introduce two speakers from MIT. They are...

Related documents

Products

Support

&gt;&gt; Vivek Narasayya: Okay, so welcome everybody. I’m Vivek... pleasure to introduce two speakers from MIT. They are...

Related documents

Add this document to collection(s)

Add this document to saved

Suggest us how to improve StudyLib

>> Vivek Narasayya: Okay, so welcome everybody. I’m Vivek... pleasure to introduce two speakers from MIT. They are...