>> Dennis Gannon: So I'm going to get this thing started. I have some announcements to make. And, first of all, welcome. Thank you all for coming. I know people probably -- since we have a lot of locals folks, people will be streaming in as they fight the traffic. But I hope -- we still have a good crowd already this morning. That's great. First of all, critical announcements. We've got -- the breakfast, as it is, is back there. Those of you that are unfamiliar with this room, the facilities are out the door around the corner. So just on the other side of this wall. We have -- we'll have a demo fest this afternoon. There's about five demos I think we've got scheduled, and that's going to be right out in this lobby out here. That will be at 4:30. And then tonight there's a reception. After the demo fest is done, we will go over to the reception area, which is called the Spitfire Grill at our Commons. It's a short walk from here. So if you're parked, you don't need to move your car or anything like that. And I have a map to show you later how to get there. First of all, if there's any questions about anything going on here, you can ask one of our team. Dan is here. There's Dan. Vani. Is Vani here? There's Vani in the back. Wenming is here. There's Wenming. And let's see. Oh, we got Harold. I don't know if Harold is in here yet. No. >>: He's clocking in. >> Dennis Gannon: He's clocking in. Okay. Kenji, is -- Kenji's there. There's Kenji in the back. Okay. And Kristin is right there. And me. I'm Dennis, by the way. So those -- this is being live streamed, and so I wanted to let anybody know that those of you that are compulsive tweeters, and especially people that are viewing this live, if you want to tweet comments, feel free to do so. We have people here that will be monitoring the tweet stream, and so it's #eScience14 is the hashtag. There's also -- we have this LinkedIn group, Microsoft Azure for Research. And so all you have to do is if you want to join that is go on. Now, it's my distinct pleasure to introduce our keynote speaker. Raghu Ramakrishnan is an important pioneer -- yeah, some of you have noticed. Bing is wonderful for getting pictures. And so we've got pictures of Raghu through his career from being a young assistant professor to being a vice president at Yahoo! to being a thought leader at Microsoft. So these are the stages that we've seen in his career so far. But he's a pioneer in the area of deductive databases, data mining, exploring data analysis, data privacy, Web-scale data stuff. Lots of data stuff. He's a Ph.D. from Texas. He's an ACM and Packard Fellow. He was a professor at University of Wisconsin-Madison, Vice President and Research Fellow at Yahoo!, and now he's a Technical Fellow here at Microsoft. And so, Raghu, I'll let you take it away. And someone's going to automatically change this. There it is. Great. >> Raghu Ramakrishnan: Thank you, Dennis. So as I told Dennis, I'm actually grateful, they chose not to use the photo with the jacket. All righty. So today I'll tell you about big data in general, but this won't be so much a talk about Azure, it will be more a talk about the field as I see it and how it's evolving. Here at Microsoft, I actually now wear two distinct hats. One, I head an applied research group called CISL, Cloud Information Services Lab, and I also now run the engineering for the big data teams. Right? So one leg on both sides, I guess. Let's get going here. So when thinking about this buzzword big data, what is it really? I think the best way to wrap your head around it is by thinking of what it lets you do that you couldn't previously do. Okay. So I'll begin by giving you a quick overview of a few applications. And I'm just going to choose some examples I worked with at Yahoo!. Time permitting, I'll say a little bit about other things. But I'll then segue from that into the tools that enable these kinds of applications. And that will take me to where I believe the frontier is shifting. So if you take the tools that have taken us thus far, MapReduce and things that build on MapReduce, what next? Right? And I'll try and give you a glimpse for some of these. The last -- the very last part of the talk, time permitting, will be about a project called REEF that we have been working on here that we just contributed to open source. It's designed to work with things like the YARN resource manager and the Hadoop ecosystem. So Microsoft is now pretty heavily involved in the Hadoop open source world. Okay. Great. By the way, if you have questions, just put up your hand; we'll talk. Don't hold your questions till the very end. I'd rather talk with you than at you. So feel free. Let's look at the applications. So one of the first things I worked on at Yahoo! was this project called Web of Concepts. And over time at Yahoo! it formed the basis for a product effort called Web of Concepts, or WOO. And similar things are there at pretty much all the Web companies. Here we call it Web of Things. At Google it is the Knowledge Graph. Same thing at the end of the day. The Web used to be this collection of pages, and search was defined as finding those pages that were closest to your search term by some measure such as TF/IDF modified in various ways. Today the game has shifted. Today the expectation is that when you ask for Mumbai, you want information about the city. Right? So users think in terms of concepts or tasks, and the game is to understand their intent at roughly the level that they think about it and to present Web content that is suitably packaged. So conceptually you want to organize information about the Web, information found on the Web, around the concepts that you think people care about, right? So everything about Mumbai is together, not organized by the page where it was found but pivoted rather on the concept. That's basically the idea. And an index here will be not so much a token-based index but something that's concept-based. Okay? As examples of what this would give you, if you go search for Julia Roberts or pretty much any celebrity, all the search engines will give you some variation of this page. The most important thing to note is the vast majority of the real estate is based on content that is somehow in a database that's been put there by retrieving information from all over the Web, scraping it, massaging it, getting feeds, and then splicing all this together. The organic blue links that you're used to, those are the first of the blue links. The rest of it is down there below the fold. Okay? So whenever possible, you'll try to present this kind of result. Why? The click-through rates are much higher. Okay. The information here is pretty much on the Web. What's different about this search result is you have synthesized, you have aggregated information about the concept in question from many different places presented them once. Of course, if they're talking about Julia Roberts, the sister-in-law, there's a bit of a problem here. Right? So figuring out what they mean when they type in the keyword, the true concept, this is tricky. So when you're not sure, you present the ten blue links. Right? You only present something like this when you are really, really sure you know what they're looking for. And part of the way you get to be sure, you use your understanding of the content to present all of these refinements of what they just typed in. Okay? And if they click, then it's a reinforcement that you indeed guessed at what they were looking for. Okay? So the work on organizing the information on the Web pays off in two distinct ways: Helping the user navigate or helping you understand their intent and then delivering the payload, the actual aggregated search result. All right. As another example, if you're looking for Idli restaurants near Ann Arbor, it [inaudible] whatever, a puffed rice cake. The way this is put together, you crawl the Web, you classify pages as restaurant pages, within there you classify pages as menu pages, within there you extract terms that refer to, you know, various kinds of food, and you build an index with each restaurant, recognize that you already know that this is a restaurant, a concept, and then within that you're looking for entities that are menu items. And now you're building an index on a combination of restaurant and menu item. Right? So now when a user clicks and says Idli near Ann Arbor, they're actually going to get restaurants that have Idlis in their actual menu, the menu contains the strip. Okay. All right. Especially as you shift the form factors like mobile where space is at a premium, using the real estate to deliver highly dense information is critical because people don't want to scroll. Let me give you another example. Content optimization. If you take the front page of Yahoo! traditionally it's been curated by editors. Every link you see there is manually placed by an editor. I should say was. Today virtually every link you see is placed there by some algorithm at the point that you click on it. On demand. This shift -- now, there's something I -- there's another project I worked on while I was there. Let's talk about this set of four links in particular. That's called today module on the Yahoo! front page. This is the same. The choice of which is the first link and which are the four links you show is the game. Underneath there is a pool of about a hundred links that you could show. Editors curate that pool, but the algorithm selects what to actually show. The difference in doing this is enormous. The click-through lift by doing this is over 300 percent. Okay? And financially that translates into many, many zeros with a significant leading digit. Okay? So today over, I don't know, the high '90s, 90 percent of all Yahoo! traffic goes through pages that are fully optimized in this sense. What does this really take? Right? On this page itself, most of the links are algorithmically selected. What does it take to do things like this? Let me go into a little bit more detail. You take all the articles, the first time you see them you have no clue who's likely to click on them, so you go by doing content analysis. You analyze the articles offline. You look at the historical performance of similar articles. You fit machine-learned models and so you have an a priori estimate of click-through for a brand-new article. Then you use this to have prior probabilities in estimating the click-through rate when you show this to a certain kind of visitor. You refine these estimates using online statistical explore-exploit algorithms. These are so-called bandit algorithms based on slot machines. But the idea is when a new article -when you have an article, you show it to someone, and they either click or they don't click. If you show it enough times, you have a pretty good estimate of the probability that the next person is likely to click. However, there's a delicate tradeoff. The articles decay. They are time sensitive. They have a few of us. Right? In those few of us, if you spend all your time estimating the probability, by the time you have a really good sense, it's irrelevant. So you want to spend the vast majority of your time exploiting the most popular articles while at the same time perhaps harmoniously exploring to figure out which articles are growing in popularity, which are shrinking, which are indeed the most popular at any given instant in time. And the mathematics of how to blend, explore, and exploit is what the statistics behind this is all about. The engineering behind this is a different kettle of fish. You have to show articles to people around the globe on tens of thousands of servers, take the results of something that happened in Shanghai, feed it into your modeling servers in Sunnyvale, and the next person in Singapore, right, is influenced by what the person in Shanghai did. Another butterfly in Brazil? It's real. Okay. So if you take what it requires to do all of this, I'm just going to put up names of some systems that are underneath this. Hadoop. We use Hadoop for everything from the actual queries to extract and analyze the summaries through the data pipelines, and getting enormous amounts of data gathered in one location on the globe and then move quickly to other locations on the globe. End to end, from the person in Shanghai to the person in Singapore, it's about ten minutes. Okay? For data streams, pipelines, no SQL stores, all of these things that you think about as, quote/unquote, big data technology boils down to one thing. You have databases operating, doing the things that databases traditionally do, large-scale queries, large-scale [inaudible] workloads, right, but now at Web scale. Right? And some of the criteria, how important is it to have fully serializable transactions. What kind of availability demands do you have. These [inaudible] have been changed in significant ways to where a brand-new generation of architectures has come into work. So let's look at that for a moment. What do I mean by Web scale? By the way, the first example on the Web of Concepts, hopefully the point there was clear. Although I didn't put up a slide showing the underlying infrastructure, you simply cannot build that kind of information extraction at Web scale without using technology like Hadoop in effect extensively. Okay. What do I mean by Web scale? Here are some numbers. I'm not going to talk to them in any detail. But basically a hundred billion e-mails a month, okay, geo replicated across a dozen zones. I'm just going to say very, very, very conservatively, 10,000-plus servers, right, millions of reads per second of a given page, visits in the order of tens of billions, right, ad serves, ad impressions, same scale. This is orders of magnitude larger than the largest enterprise databases have historically been. Well, it's going to be dwarfed within a few years by what we are going to see from [inaudible] class of applications, the Internet of Things. And this encompasses not just things like Web sites, it's going to seep into everything you do, right? Biology, environmental science. It's all potentially going to be transformed by the ability to embed sensors of every stripe in everything you do. Your ability to observe is going to be unparalleled. The fundamental difference between a Web site and a traditional application like, say, Word, is observability. If someone sneezes on a Web page, I know how long and how loud. Right? And that's what I can react to. That observability is lacking in a package application running on your desktop where no one can view it. Of course some of you are thinking Office 365, no doubt. Yes, we can watch you sneeze. And that means many of these techniques are now going to find their way into traditional enterprise applications. Different story. Internet of Things means, you know, this story of your refrigerator talking to your shopping list app on your phone saying we're out of eggs. Now that you're in the grocery store, get some. This is not science fiction. Your thermometer can sense -- these are commercially available today -- oh, there's no one in the house, drop the temperature. And mostly people come back around five o'clock, so let's crank it up again around four o'clock so it's warm and toasty by the time they're back in. These are all very doable. Right? It's only a matter of standardization, right, so that it can be done at scale and cost-effectively enough for all of us to use as opposed to just the geeks in Silicon Valley. Right? It's simply a matter of the price curves coming down, and that's a matter of standardization. The numbers here a mind-boggling. In case you don't notice, by 2020 they're predicting there will be 50 billion devices. Right? The amount of traffic on the Internet is going to be 275 exabytes per day. These are eye-poppingly large numbers. And all this data, what does it mean? We are able to observe data automatically, capture this data in ways we never could before. Prior to the Internet generation, data entry was largely through keystrokes. The volume of useful data you can enter that way is minuscule. Right? Now data capture has become trivial and ubiquitous. Everything is observable, for better or worse. The cost of hardware has plummeted. And our technical ability to build scale-out analytic systems, which is what I'll talk about for the rest of the talk means we can actually do useful things with all this data in domain after domain after domain. And that I think it's confluence of things that's made this whole field explosive. Lastly, don't underestimate the power of the cloud. Right? You can have all the technology in the world that lets you operate a farm of 10,000 servers. But if you need to buy those 10,000 servers and install them yourselves, you're going to be a long while doing it. The cloud takes away that last barrier. You can rent on demand. Right? Someone else will run that farm for you. So without setting the stage, let's move on. I'll skip this. Here's an example from Microsoft, actually, what does it mean to have an operating system for all the sensors that go into your house, how do we standardize this. As I said, standardization is the next frontier. Already lots of people are thinking about this. It's going to happen. Okay? So what kind of an architecture do I envision all this will require? First, I think in contrast of traditional database systems, you can't have things siloed. Your analytics stores, your transactional stores. For analytics, many different kinds of back ends. All this increases the barrier to use by a user. You're going to have a digital shoebox. Or some people call it a pool or a lake or whatever the metaphor you use. One place where you can put a diverse range of data, be it documents, be it multimedia, be it traditional relations, be it streams, be it graphs, right, and it can put any amount of data in the same place. It will just expand and hold it. And not only hold it, allow you to retrieve efficiently. Okay? Second, that efficient part is going to push the frontier in terms of the technology. A lot of the scale-out to date owes to work in parallel databases around work on parallel distributed file systems like GFS. And the intuition essentially is spray your data uniformly across a bazillion machines and then shard your computation accordingly, break up your query, for example, into little tiny pieces on the shards of data that that larger query touches. So if you scan a file and the file is on a thousand machines, then your query becomes a thousand little baby queries, each running on a shard of your data. Okay? That simple principle carried us a long way. But now in addition to this spread, there is a depth. If you look at data that's on local disk, all the file systems we talk about essentially stop there. There are three copies on the local disks of three different machines. That's as far as the metadata and HDFS gets you. But now, whether something is in the local disk or main memory or SSD or MDRAM makes a huge difference. Not only performance-wise, these different forms of memory are becoming cheap enough you can have scads of main memory, for example, loaded onto a machine. And if you don't make use of it wisely, you will not meet the performance criteria that you need on many of these applications. Furthermore, for an expense of storage, you don't stop with a single machine, of course. You have remote storage. You have tape. You have all kinds of storage that farther away, slower, but much cheaper. Right? This means if you really want to store everything at all without worrying about the cost, your buffer management becomes crucial. You're not going to see buffer management on steroids. You know, very far or very near and across tens of thousands of machines. This is the challenge of tiered storage. Right? And, oh, one last little detail. Historically database buffer managers have been tied crucially to one class of workloads, SQL queries. Now, as I'll describe in a moment, you're going to have a plethora of analytics from graph processing to streaming to machine learning to SQL queries, and you need a buffer manager that can take you a long distance towards all of these workloads. So as a database person, you know, my cup floweth over. This is lifetime job security. Okay. Of course how beautiful your store, people will always have good reasons for keeping their data in other places as well. Maybe on-prem. Maybe in some specialized store that they're locked into for whatever legacy reasons. It's up to us to make access at least, maybe not every bit as efficient, but access, functional access available to all data no matter where it is. Now, on top of it, from the end users' point of view, they care about many different programming paradigms. SQL is ubiquitous if you take things like Hive, Pig, they're just variants of SQL, in my opinion. MapReduce has gained a lot of mind share. But let's talk for a moment and think about MapReduce. How many of you know about MapReduce already? Okay. Great. So I can be really brief. MapReduce is just SQL's group by. At a very surface level, that is indeed true. But there's a bit more to it. First, when you group the underlying rows in the group, if you preserve them as opposed to distilling them through an aggregate, that is the map step. And, subsequently, within each group, rather than applying one of a predefined set of aggregates, if you could write arbitrary user code to run against a partition, that's reduce. So MapReduce and SQL group bys indeed are very similar. The language construct owes, of course, to lambda calculus. But if you take all of this, you're still missing the essential contribution in MapReduce, which I think are twofold. First, an enormous range of user code can be effectively parallelized in this manner. That's a deep insight. Right? For the class of things people did at the Web search companies -- Alta Vista, Bing, Inktomi, right -- this was well known. The genius of the MapReduce abstraction was to make it broadly available. The second step, when you parallelize at the scale of thousands of servers, traditional database systems didn't go above tens. Right? The scheme of restarting in any one-part fail is not good enough. You have to plan for failure. So failure-centric architectures were the other big contribution of MapReduce systems. But language-wise today, MapReduce is folded into SQL extensions like Hive or Pig or here in Microsoft something called Scope. Okay? And if you look at the statistics, 95, 98 percent of all MapReduce queries at Facebook, at Yahoo! they're really Hive or Pig queries that have been translated into MapReduce. So end users using MapReduce is a vanishingly small case. Right? The revolution has been somewhat different. But those 2 or 3 percent where you still have to default to MapReduce do turn out to be significant, which is why all of these languages support essentially MapReduce in slightly different syntax. Okay. Coming back. Stream processing. Realtime analytics is now again becoming mainstream. Things like roll up, drill down BI. It's a $2 billion business for Microsoft alone. Right? As a market, it's much bigger. Machine learning is growing explosively. Right? When you have so much data, everyone kind of understands, hey, human analysis, manual analysis can't cover all of it. What can we teach machines to look for on their own. If you take this diversity of analytics over all the variety of data that you expect to store here, the underlying computational substrate, how do you build it, are you going to build a different stack for SQL, for stream processing, for machine learning, or can you share some steps in between. Right? This compute fabric is an area where there's a lot of research going on today. Okay? And I'll say a little bit more about it. And this is the new look at virtually any big data player today. You'll see some variant of this slide in how they think about the space. Okay? It's not unique to us. So here's a link actually -- here's a link actually where you can go look at what many, many people are doing in this space. All right. In the second half of this talk, I'll talk a little bit more about that compute fabric. And essentially here's the question I'll try and address. Given that there's enormous amounts of very diverse data, being analyzed in a single application, no mind you, right, at different stages in the pipeline, I use SQL, I use machine learning, I use streaming systems, I use graph analytics, what's the system underneath? Am I going to see a single SQL box that all of a sudden does streaming, does machine learning, does this, does that, every, or am I going to see a whole federation of systems built completely standalone which I mix and match on my own dime. I think either of these is unrealistic. The first will simply not allow for agility that we see here in this space. The -- and the systems will simply not be usable. Right? SQL by itself was being criticized for having everything but the kitchen sink. When you take all of these other things and slap them all together, it's unmanageable. The second alternative, go ahead and build your stacks by yourself in each domain -- I mean in each type of analytics and let the end users figure out how to mix and match. End users will lack. They cannot. Right? You need to make interoperability across, say, SQL and machine learning a first-class design consideration and make the edges seamless. That means the underlying computation fabric needs to satisfy some criteria. The intermediate computation in a SQL query must be something you can pipeline to a MapReduce step, for example. The iterations in machine learning, you must be able to seamlessly pass along. So these kinds of considerations, where do they lead us? What I think is going to happen is an evolution of what we are already seeing in Hadoop with YARN -- how many of you have heard of YARN or Mesos or -- okay. Fewer people. So let me speak briefly about this. If you take MapReduce, the original implementation of MapReduce in Hadoop, which was an open source implementation of the GFS MapReduce stack from Google, Hadoop was largely done at Yahoo!, the original implementation of MapReduce was monolithic. Right? The programming paradigm of MapReduce, the implementation, all the way down to the bare metal, it was one composite thing. Then along came Pig which was Yahoo!'s variant of SQL. In parallel there was Hive which was Facebook's version of the same thing. These higher level languages, they had a choice: Do we implement them from scratch? Oh, boy, it's going to take a lot of effort. Right? What if we just translate Pig queries into MapReduce programs? That was the original implementation approach. They simply translate it and then executed the translated program. Now, there's a price here. If you take a Pig program and translate it into MapReduce, instead of the same Pig program and as a human being rewrite that in MapReduce from scratch, the difference between these was in the original days easily over a factor of 10 in performance. Today it's still a nontrivial tax you pay. Not as big. It didn't matter. Right? As I told you, from the very earliest days, people started using these higher level languages more than MapReduce, to the point where today the usage is in the high 90 percent of MapReduce programs being translated. Which led to the obvious question. Maybe we should think about implementing languages like Pig and Hive natively. Then again, do we need to start from scratch? Are there things that are common? Yes, there are. So think about how these programs work. A user submits a job. You know that the job is going to be broken up into baby jobs running against parts of the data. To do so, you're going to have to go to the underlying cluster and say, hey, give me compute containers with some constraints. So let's take a very simple example. I want to scan a very large file and count the average of some column in that file. Right? So my data is broken up across a thousand machines, and what I'm going to go and say is, hey, give me a thousand slots on each of these thousand machines so each slot I can use to compute the local aggregate and then I'll sum them all up. So first you go negotiate. You get the resources. Then you install the particular executable for your query in each of these thousand slots. And then off you go. What you put into those thousand slots is specific to this particular query, more generally to SQL. The SQL engine is something that does this. But the actual negotiation with some owner of the underlying resources, this is common. This realization led to resource managers like YARN, which was the Yahoo! version; Mesos, the Berkeley version; Corona, the Facebook version; Omega, the Google version -- I mean, there's a pattern here, right? Everyone saw the same writing on the wall. And the net of it is now the resource managers form a common substrate. No one has to do resource management separately. They sit on top, get their resources. From their on, they do their own thing. Okay? This also provides some other advantages. If you have a multi-tenanted cluster, priority can be set by whose needs are most important regardless of whether they're issuing a SQL query or a machine learning query. You can interleave very different types of computation. You can have heterogeneous [inaudible] pipelines all executing on the same cluster. There is no notion of these machines do only SQL; those machines do only machine learning. Right? Lots of things followed. Okay. So that's resource management. Let's talk about YARN very, very briefly. I said some of this already. Right? Multiple kinds of engines can share the same pool. And I use the word container or slot interchangeably. Right? If you take this, the underlying algorithms for scheduling deciding whether Danny's request trumps Raghu's request -- it should, of course -- this is a lot of research. Okay? Now -- I need to have a technical slide here. Come on. I'm not wearing a coat. So here's an example of some work that we actually did on preemption. And I'm just going to give you a brief feel for this. It will also help me illustrate some of what goes on in MapReduce. So in MapReduce, it works in waves. First you take your data, you partition it in a way that's appropriate to your problem using the map step. The mappers all execute as independent tasks. When they are done, the reducers come along and take each partition and do something else, and what you are left with is a collection of partial results which you sum together and you get your final result. Let's look at this graphically. Here are the mappers, each line is there's a mapper that starts here and ends only here; this mapper begins, ends, begins, ends, and so on. These are all mappers. Each row here is a particular machine and what's going on on that machine. Okay? This of course is time. These are stragglers. It's well known that the [inaudible] of map jobs in a map phase have a huge impact on overall performance. Let's see why. In standard MapReduce -- in all of these systems, by the way, to begin with, preemption was not part of the story. So once you allocate a slot to a task, you simply wait till the bitter end until that task completes and returns a sort to you. So in that regime, let's see what happens. You allocate all these slots, the map steps of a job. They're all going to go on. Some of these finish before others, so you can reuse some of those slots, right, so some of these rows are overlaid on some of these rows. Okay? In fact, on some of these machines. The reduce step starts. But now somewhere here let's say those reducers get blocked because they're waiting for the straggler to compute. Remember, in a group by, if you're consuming the result, you need to be sure your group is complete. When you get there, you hit a hard wall. In sorting, the exact same story. You guys know what I'm talking about. So here these machines lie fallow. Why? The choice is to nuke all these jobs and start over -- which they do sometimes; it has its own consequences -- or you just wait to be unblocked. When finally the straggler completes, you can resume and then the remaining reducers are scheduled as well and the job eventually completes. Now, this red area starkly illustrates the impact of a straggler in the absence of preemption. Okay. So you can begin to see the subtleties under the covers. Okay. Now let's look at this option instead. If you could preempt, if you could say at this point, you know, save the state efficiently and use these machines to do something else because you're blocked on these particular tasks, what could you do, right? If you were to do this, you could continue, you could use -- you could get some of the other reducers going on the same physical machines here, right, those -- you know, I wish I [inaudible]. In that case I will have to use this. But here. If you look at these reducers, they're running on these machines which have now been paused on these reduce tasks. You get what I'm saying. And therefore instead of sitting on your fanny, you could actually be doing some useful stuff, which means the whole thing completes sooner. Okay? This is another assumption that we can efficiently checkpoint the work done by these guys. So you need a checkpoint mechanism. But if you had that I you get a significant improvement in performance. Right? This is the kind of stuff that's going on as we speak. These systems, for all the impressive characteristics, there's a lot of traditional stuff that they need to incorporate let alone break some new ground in many ways. So this particular scenario we showed, it had a one-third improvement. All I want to say in the next few slides is, you know what, all of this work, we actually contributed to Hadoop open source, Apache open source. Here are the [inaudible] for those of you who are interested. This is just to kind of make the point Microsoft is serious about giving back to open source. So we are very much involved in Hadoop both as a consumer -- we offer a product called HDInsight, which is similar to EMR, right? So we consume from open source. And, as you saw in this example, we also give back. In fact, several of the people on the Hadoop council commit as they work at Microsoft. All right. Switching gears now. I've taken you all the way through YARN. What next? Well, let's follow the trend. I now want to use YARN or a resource manager like YARN to build SQL, to build machine learning. Have I gotten everything in common that I could? From here on, do I have to build custom versions of each, of each stack? Hopefully not. So the REEF project was an attempt to push the common substrate one level higher. But on a lot of things, for example, if I have a thousand tasks, which of them died, just monitoring them can be a store. Restarting automatically can be a chore. Checkpointing can be a chore. How many of these things can you package into a collection of libraries that are common? What are the further deeper benefits of doing this? That's what the REEF story is about, and I'll try and walk you through that. Okay. Again, there's a Microsoft project done in CISL now going on in the big data team as well, and again we have own sourced it through Apache. So let's take a concrete example of machine learning. User activity modeling. You want to find out what a particular user likes. Okay. And you infer this by looking at the pages they browsed, the queries that asked, and the ads they ignored or clicked. Mostly people ignore ads. If you look at the number of pages per user, a few tenths. The number of possible pages there are in a typical Web site, millions. Right? The number of queries that users collectively ask, hundreds of millions, where they actually ask in a given user, relativity few. The point being these are highly sparse. But these there observations you have to work with. How do you learn from this? Imagine this time window in which the users' activity is overlaid and you gather this from various kinds of logs, Web logs or search logs. And in any given window, you look at the things that you can observe. Oh, the user just issued this kind of query or visited this kind of Web site. Based on that, what are they going to do next? That is the crucial question. This is what user activity modeling is going to provide for you. Features that given this will allow you to predict this. Okay? So if someone visited finance and issues a query about the stock market, chances are good they'll sign up for E*TRADE. Okay? I am oversimplifying, but you get the idea. How do you build such models? You take your logs and slide this window step by step, and each slide of the window, right, gives you a case to learn from. So the very first step is to gather all these logs, clean them up, deal with things like robot clicks and winnow them out of the way. And then slide, slide, slide, each time extract a row in your training database. This is a ginormous task, and it is proprietary to any kind of modeling. Okay. And if you look at the times involved, applying the data takes several of us. These are somewhat old Yahoo! numbers, but you'll see very similar numbers from Bing or from Google. Feature and target generation, right, roughly each window, feature window has about a terabyte. This work, extracting the cases, takes about four to six hours. The actual model training, what we write papers about, this takes one to two of us. And not just for a single model, while you're at it, you may as well build hundreds of models, okay, because you're going to evaluate them all through bucket texts. Just think about this for a moment. When you think about the nexus between big data and analytics or machine learning, this slide is something that should be burned into your soul. Okay? Machine learning academically is all about the algorithms. Machine learning for real is all about the plumbing. Right? And this is why having your scalable data management systems be seamlessly integrated with your machine learning frameworks is crucial. Okay. Let's look at this in a bit more detail. One of the things I looked at at Yahoo! was phishing, mail spam, and the like. So here's an example about a spam filter. You have your inbox and the algorithms, the spam filters try to take a given user's name and fork them into real mail and spam. Okay? The user actually sees this through the mail front end. Okay? Occasionally you screw up. You take spam and put it in the inbox. Even more occasionally someone will complain. They'll mark that as spam. And that's your signal to learn from. Okay. Now let's fast-forward. What does it take to learn from that? You need example formation, modeling, evaluation, the same cycle you saw earlier. And if you look at this -- I'll start moving a little bit faster here -- the example formation can really be done reasonably efficiently using systems like Hadoop. It's really a gigantic join between the mailbox and the spam signals. You can do this work. Hadoop helps you do this work. Okay? If you take a modeling step, different story. Modeling is not so easy. Modeling is massively iterative, and MapReduce doesn't really support iteration. In between iterations you need to write to disk. So nowadays you have seen things like Spark and Shark that try to keep things in memory to avoid writing to disk in between iterations. Okay? The good news is MapReduce supports one of the basic models for machine learning, the Statistical Query Model of Michael Kearns, pretty well. Other models, they don't map so well to MapReduce. Net of it, this iterative cycle, even a single model is iterative, but there's an outer loop, a prior model, observers, updated, try again, do different kinds of features, try again. This whole iterative cycle is not real supported in MapReduce. So going a little bit faster, net of it is what I just told you. It can be used, but not easily, not officially. What is this led to? People cheat. People are infinitely creative. They use map-only jobs, no reduce step. What the heck is going on here? They're essentially getting threads from a ginormous thread pool and having a party. Okay? You can do anything you want, just give me the resources and get out of the way. But if you do this -- I hate animations. Okay. These are examples. For those of you familiar with machine learning, there are things like (All)reduce and decision trees which also could impose a complex communication structure across many tasks. You could design your own. Each of these is a map-only task. You establish your own communication channels. If one of those boxes dies, you're in doo-doo. You deal with it. All of these things that MapReduce supposedly gave you as part of the framework you do all over again because you're abusing the system to make it jump through hoops it wasn't designed to. Okay? So all said, let me cut to the chase here. Yeah. I can go on to -- through more and more of these examples. All this leads to unhappiness on both sides. The people who do this kind of abuse, not because they want to be abusers, because they don't have options, they have a fault tolerance mismatch, so they have to redo fault tolerance or live without it. The resource model, and they're just building their own, there's no notion of a tree, for example, their integration of MapReduce essentially should treat them as map-only jobs. They have to deal with networking, cluster memberships, bulk data transfers. Pretty much all of the work. For the cluster, life is not so great. For example, the network usage patterns in these kinds of applications are very different. This leads to problems. [inaudible] consider what happens when you need, say, gang scheduling. If you need a collection of nodes to be given to you in order to proceed, this happens, for example, in Giraph, graph processing system, what does Giraph do? It gets resources one by one from the underlying resource manager until it has enough. Meantime, it squats on the resources it has. This means the utilization with the system [inaudible]. Everyone else is affected because you have given someone resources piecemeal when what their requirement is is an all or nothing, give me at least a hundred. Right? So how do you deal with all of this? What we really need is this intermediate stage that complements YARN, right, to facilitate development of applications on top of YARN and lets you build pipelines of these different kinds of computations. So let me illustrate this with a concrete example, and then I'll conclude. In this example, the job driver is really the control in the example program. Activity, it's the user code. This is what you would execute in a MapReduce task, your actual code running there. Evaluate is simply the container. It's a REEF term just saying it's a REEF-controlled evaluator. Okay? But this, let's consider the simplest of tasks. Let's say you want to do distributed shell. Run the same command on two different machines. Here's what would happen in a scenario with REEF. You want to run this command. You'd first come to a cluster that knows about Hadoop, YARN, REEF, all these things. And when you come in, the client is -- you know, your client executing trying to do distributed shell. You submit your job. The first thing that's going to happen is you will launch a REEF container on one of the machines. The very first thing that REEF container will do is to launch this special program called the job driver which includes your code to orchestrate the logic of your program. The job driver now negotiates on behalf of the job, right, with the resource manager. The negotiation results in tokens it can use to create tasks. It will get containers through the resource manager, and on these it will install some libraries, it will install the actual user code, right, your task, and it's able to run this task on this machine. Imagine the same thing happening on all the other machines in parallel. Okay? Now your command is running on these two machines. And instead of running dire, you could be running anything at all you wanted across these two machines. Or 20,000 machines, for that matter. These are going to now communicate with the job driver through a regular heart beating mechanism which primarily allows the job driver to take over the mundane task of baby sitting these bazillion tasks, restarting them should one of them fail. And the restart may happen on a different machine. Because if it really -- if a task fails, the job driver will detect this, go back to the resource manager, get another container, restart the activity. If you want the state that was lost to be available, you will explicitly use one of the checkpoint commands to save the state durably across machines and that will then be available when the job is restarted. So in a nutshell, these are all capabilities that make it possible to write, say, a SQL engine or a new implementation of Giraph or a new implementation of a machine learning algorithm at scale. One last thing, and this is a really, really, really important thing. Okay. Let me get past the -- ah. Here. So let's say this particular task completes. You still have the evaluator. So REEF essentially acts as the middleman between you and the cluster manager. And when you complete one of the tasks you have installed in a container, you have the opportunity to install a follow-up task. So think iterations in machine learning. After one iteration completes, your container is available with metadata about what state you have left over from that iteration. You can install the code for the next iteration. The data never goes to disk and comes back. You're iterating in main memory. This by itself, either doing this or running in something like Spark, gives you a 30X improvement in performance. Okay? Now, when you pull all this together, what you effectively have is the REEF system. In the interest of time, speaking of checkpoints, I'm going to checkpoint right here. Okay? For those of you familiar with Rx, there's a lot of similarity with Rx and the underlying APIs. Net-net [phonetic] -- if you're interested, all of this is available through GitHub. Send me or Dennis mail and we'll send you a link. Let me pause here. Today the main message I want you to take away, over the next five years everything is going to evolve around data. The kind of data is diverse. Right? It could be data that's very specific to what you're doing. But what is universal is our world is becoming data driven. And this is going to require us to develop systems to manage data of a diversity and a scale that's unprecedented and to provide analytic tools -- here's one other point. If everything is data driven, a corollary is domain scientists who care about the domain science, not learning ever-increasing complicated versions of SQL and this and that, are going to be using these systems. That means you need to build domain-specific languages. You're going to see a plethora of analytics, is my guess. So having the kitchen sink is not an option. Having many tailored systems, if that's where you're going, these tailored systems inevitably will have to do some part of their work in concert with other systems. Right? Everyone, for example, will have to live with SQL. Right? How do you support these kinds of diverse analytic engines on top of a common compute fabric, a common caching capability, a common distribution, checkpointing, fault tolerance, right? All the major players are thinking about this in the cloud, Google, Amazon, Microsoft, they're all in the game. It's a fun time to be working on this kind of stuff. So I'll pause there. >> Dennis Gannon: Let's thank you Raghu. [applause]