>> Dennis Gannon: So I'm going to get this... to make. And, first of all, welcome. Thank...

advertisement
>> Dennis Gannon: So I'm going to get this thing started. I have some announcements
to make. And, first of all, welcome. Thank you all for coming. I know people
probably -- since we have a lot of locals folks, people will be streaming in as they fight
the traffic. But I hope -- we still have a good crowd already this morning. That's great.
First of all, critical announcements. We've got -- the breakfast, as it is, is back there.
Those of you that are unfamiliar with this room, the facilities are out the door around the
corner. So just on the other side of this wall.
We have -- we'll have a demo fest this afternoon. There's about five demos I think we've
got scheduled, and that's going to be right out in this lobby out here. That will be at 4:30.
And then tonight there's a reception. After the demo fest is done, we will go over to the
reception area, which is called the Spitfire Grill at our Commons. It's a short walk from
here. So if you're parked, you don't need to move your car or anything like that.
And I have a map to show you later how to get there.
First of all, if there's any questions about anything going on here, you can ask one of our
team. Dan is here. There's Dan. Vani. Is Vani here? There's Vani in the back.
Wenming is here. There's Wenming. And let's see. Oh, we got Harold. I don't know if
Harold is in here yet. No.
>>: He's clocking in.
>> Dennis Gannon: He's clocking in. Okay. Kenji, is -- Kenji's there. There's Kenji in
the back. Okay. And Kristin is right there. And me. I'm Dennis, by the way.
So those -- this is being live streamed, and so I wanted to let anybody know that those of
you that are compulsive tweeters, and especially people that are viewing this live, if you
want to tweet comments, feel free to do so. We have people here that will be monitoring
the tweet stream, and so it's #eScience14 is the hashtag. There's also -- we have this
LinkedIn group, Microsoft Azure for Research. And so all you have to do is if you want
to join that is go on.
Now, it's my distinct pleasure to introduce our keynote speaker. Raghu Ramakrishnan is
an important pioneer -- yeah, some of you have noticed. Bing is wonderful for getting
pictures. And so we've got pictures of Raghu through his career from being a young
assistant professor to being a vice president at Yahoo! to being a thought leader at
Microsoft. So these are the stages that we've seen in his career so far.
But he's a pioneer in the area of deductive databases, data mining, exploring data
analysis, data privacy, Web-scale data stuff. Lots of data stuff.
He's a Ph.D. from Texas. He's an ACM and Packard Fellow. He was a professor at
University of Wisconsin-Madison, Vice President and Research Fellow at Yahoo!, and
now he's a Technical Fellow here at Microsoft.
And so, Raghu, I'll let you take it away. And someone's going to automatically change
this. There it is. Great.
>> Raghu Ramakrishnan: Thank you, Dennis. So as I told Dennis, I'm actually grateful,
they chose not to use the photo with the jacket.
All righty. So today I'll tell you about big data in general, but this won't be so much a
talk about Azure, it will be more a talk about the field as I see it and how it's evolving.
Here at Microsoft, I actually now wear two distinct hats. One, I head an applied research
group called CISL, Cloud Information Services Lab, and I also now run the engineering
for the big data teams. Right? So one leg on both sides, I guess.
Let's get going here.
So when thinking about this buzzword big data, what is it really? I think the best way to
wrap your head around it is by thinking of what it lets you do that you couldn't previously
do. Okay.
So I'll begin by giving you a quick overview of a few applications. And I'm just going to
choose some examples I worked with at Yahoo!. Time permitting, I'll say a little bit
about other things. But I'll then segue from that into the tools that enable these kinds of
applications. And that will take me to where I believe the frontier is shifting.
So if you take the tools that have taken us thus far, MapReduce and things that build on
MapReduce, what next? Right? And I'll try and give you a glimpse for some of these.
The last -- the very last part of the talk, time permitting, will be about a project called
REEF that we have been working on here that we just contributed to open source. It's
designed to work with things like the YARN resource manager and the Hadoop
ecosystem. So Microsoft is now pretty heavily involved in the Hadoop open source
world. Okay. Great.
By the way, if you have questions, just put up your hand; we'll talk. Don't hold your
questions till the very end. I'd rather talk with you than at you. So feel free.
Let's look at the applications. So one of the first things I worked on at Yahoo! was this
project called Web of Concepts. And over time at Yahoo! it formed the basis for a
product effort called Web of Concepts, or WOO. And similar things are there at pretty
much all the Web companies. Here we call it Web of Things. At Google it is the
Knowledge Graph. Same thing at the end of the day.
The Web used to be this collection of pages, and search was defined as finding those
pages that were closest to your search term by some measure such as TF/IDF modified in
various ways.
Today the game has shifted. Today the expectation is that when you ask for Mumbai,
you want information about the city. Right? So users think in terms of concepts or tasks,
and the game is to understand their intent at roughly the level that they think about it and
to present Web content that is suitably packaged.
So conceptually you want to organize information about the Web, information found on
the Web, around the concepts that you think people care about, right?
So everything about Mumbai is together, not organized by the page where it was found
but pivoted rather on the concept. That's basically the idea. And an index here will be
not so much a token-based index but something that's concept-based. Okay?
As examples of what this would give you, if you go search for Julia Roberts or pretty
much any celebrity, all the search engines will give you some variation of this page. The
most important thing to note is the vast majority of the real estate is based on content that
is somehow in a database that's been put there by retrieving information from all over the
Web, scraping it, massaging it, getting feeds, and then splicing all this together.
The organic blue links that you're used to, those are the first of the blue links. The rest of
it is down there below the fold. Okay?
So whenever possible, you'll try to present this kind of result. Why? The click-through
rates are much higher. Okay. The information here is pretty much on the Web. What's
different about this search result is you have synthesized, you have aggregated
information about the concept in question from many different places presented them
once.
Of course, if they're talking about Julia Roberts, the sister-in-law, there's a bit of a
problem here. Right? So figuring out what they mean when they type in the keyword,
the true concept, this is tricky. So when you're not sure, you present the ten blue links.
Right? You only present something like this when you are really, really sure you know
what they're looking for.
And part of the way you get to be sure, you use your understanding of the content to
present all of these refinements of what they just typed in. Okay? And if they click, then
it's a reinforcement that you indeed guessed at what they were looking for. Okay?
So the work on organizing the information on the Web pays off in two distinct ways:
Helping the user navigate or helping you understand their intent and then delivering the
payload, the actual aggregated search result.
All right. As another example, if you're looking for Idli restaurants near Ann Arbor, it
[inaudible] whatever, a puffed rice cake.
The way this is put together, you crawl the Web, you classify pages as restaurant pages,
within there you classify pages as menu pages, within there you extract terms that refer
to, you know, various kinds of food, and you build an index with each restaurant,
recognize that you already know that this is a restaurant, a concept, and then within that
you're looking for entities that are menu items. And now you're building an index on a
combination of restaurant and menu item. Right?
So now when a user clicks and says Idli near Ann Arbor, they're actually going to get
restaurants that have Idlis in their actual menu, the menu contains the strip. Okay. All
right.
Especially as you shift the form factors like mobile where space is at a premium, using
the real estate to deliver highly dense information is critical because people don't want to
scroll.
Let me give you another example. Content optimization. If you take the front page of
Yahoo! traditionally it's been curated by editors. Every link you see there is manually
placed by an editor. I should say was. Today virtually every link you see is placed there
by some algorithm at the point that you click on it. On demand.
This shift -- now, there's something I -- there's another project I worked on while I was
there. Let's talk about this set of four links in particular. That's called today module on
the Yahoo! front page. This is the same. The choice of which is the first link and which
are the four links you show is the game. Underneath there is a pool of about a hundred
links that you could show. Editors curate that pool, but the algorithm selects what to
actually show.
The difference in doing this is enormous. The click-through lift by doing this is over 300
percent. Okay? And financially that translates into many, many zeros with a significant
leading digit. Okay?
So today over, I don't know, the high '90s, 90 percent of all Yahoo! traffic goes through
pages that are fully optimized in this sense. What does this really take? Right? On this
page itself, most of the links are algorithmically selected. What does it take to do things
like this? Let me go into a little bit more detail.
You take all the articles, the first time you see them you have no clue who's likely to click
on them, so you go by doing content analysis. You analyze the articles offline. You look
at the historical performance of similar articles. You fit machine-learned models and so
you have an a priori estimate of click-through for a brand-new article.
Then you use this to have prior probabilities in estimating the click-through rate when
you show this to a certain kind of visitor.
You refine these estimates using online statistical explore-exploit algorithms. These are
so-called bandit algorithms based on slot machines. But the idea is when a new article -when you have an article, you show it to someone, and they either click or they don't
click. If you show it enough times, you have a pretty good estimate of the probability
that the next person is likely to click.
However, there's a delicate tradeoff. The articles decay. They are time sensitive. They
have a few of us. Right? In those few of us, if you spend all your time estimating the
probability, by the time you have a really good sense, it's irrelevant.
So you want to spend the vast majority of your time exploiting the most popular articles
while at the same time perhaps harmoniously exploring to figure out which articles are
growing in popularity, which are shrinking, which are indeed the most popular at any
given instant in time. And the mathematics of how to blend, explore, and exploit is what
the statistics behind this is all about.
The engineering behind this is a different kettle of fish. You have to show articles to
people around the globe on tens of thousands of servers, take the results of something that
happened in Shanghai, feed it into your modeling servers in Sunnyvale, and the next
person in Singapore, right, is influenced by what the person in Shanghai did. Another
butterfly in Brazil? It's real. Okay.
So if you take what it requires to do all of this, I'm just going to put up names of some
systems that are underneath this. Hadoop. We use Hadoop for everything from the
actual queries to extract and analyze the summaries through the data pipelines, and
getting enormous amounts of data gathered in one location on the globe and then move
quickly to other locations on the globe. End to end, from the person in Shanghai to the
person in Singapore, it's about ten minutes. Okay?
For data streams, pipelines, no SQL stores, all of these things that you think about as,
quote/unquote, big data technology boils down to one thing. You have databases
operating, doing the things that databases traditionally do, large-scale queries, large-scale
[inaudible] workloads, right, but now at Web scale. Right?
And some of the criteria, how important is it to have fully serializable transactions. What
kind of availability demands do you have. These [inaudible] have been changed in
significant ways to where a brand-new generation of architectures has come into work.
So let's look at that for a moment. What do I mean by Web scale? By the way, the first
example on the Web of Concepts, hopefully the point there was clear. Although I didn't
put up a slide showing the underlying infrastructure, you simply cannot build that kind of
information extraction at Web scale without using technology like Hadoop in effect
extensively. Okay.
What do I mean by Web scale? Here are some numbers. I'm not going to talk to them in
any detail. But basically a hundred billion e-mails a month, okay, geo replicated across a
dozen zones. I'm just going to say very, very, very conservatively, 10,000-plus servers,
right, millions of reads per second of a given page, visits in the order of tens of billions,
right, ad serves, ad impressions, same scale.
This is orders of magnitude larger than the largest enterprise databases have historically
been. Well, it's going to be dwarfed within a few years by what we are going to see from
[inaudible] class of applications, the Internet of Things.
And this encompasses not just things like Web sites, it's going to seep into everything
you do, right? Biology, environmental science. It's all potentially going to be
transformed by the ability to embed sensors of every stripe in everything you do. Your
ability to observe is going to be unparalleled.
The fundamental difference between a Web site and a traditional application like, say,
Word, is observability. If someone sneezes on a Web page, I know how long and how
loud. Right? And that's what I can react to.
That observability is lacking in a package application running on your desktop where no
one can view it. Of course some of you are thinking Office 365, no doubt. Yes, we can
watch you sneeze. And that means many of these techniques are now going to find their
way into traditional enterprise applications. Different story.
Internet of Things means, you know, this story of your refrigerator talking to your
shopping list app on your phone saying we're out of eggs. Now that you're in the grocery
store, get some. This is not science fiction. Your thermometer can sense -- these are
commercially available today -- oh, there's no one in the house, drop the temperature.
And mostly people come back around five o'clock, so let's crank it up again around four
o'clock so it's warm and toasty by the time they're back in.
These are all very doable. Right? It's only a matter of standardization, right, so that it
can be done at scale and cost-effectively enough for all of us to use as opposed to just the
geeks in Silicon Valley. Right? It's simply a matter of the price curves coming down,
and that's a matter of standardization.
The numbers here a mind-boggling. In case you don't notice, by 2020 they're predicting
there will be 50 billion devices. Right? The amount of traffic on the Internet is going to
be 275 exabytes per day. These are eye-poppingly large numbers.
And all this data, what does it mean? We are able to observe data automatically, capture
this data in ways we never could before. Prior to the Internet generation, data entry was
largely through keystrokes. The volume of useful data you can enter that way is
minuscule. Right?
Now data capture has become trivial and ubiquitous. Everything is observable, for better
or worse. The cost of hardware has plummeted. And our technical ability to build
scale-out analytic systems, which is what I'll talk about for the rest of the talk means we
can actually do useful things with all this data in domain after domain after domain. And
that I think it's confluence of things that's made this whole field explosive.
Lastly, don't underestimate the power of the cloud. Right? You can have all the
technology in the world that lets you operate a farm of 10,000 servers. But if you need to
buy those 10,000 servers and install them yourselves, you're going to be a long while
doing it. The cloud takes away that last barrier. You can rent on demand. Right?
Someone else will run that farm for you.
So without setting the stage, let's move on. I'll skip this. Here's an example from
Microsoft, actually, what does it mean to have an operating system for all the sensors that
go into your house, how do we standardize this. As I said, standardization is the next
frontier. Already lots of people are thinking about this. It's going to happen. Okay?
So what kind of an architecture do I envision all this will require? First, I think in
contrast of traditional database systems, you can't have things siloed. Your analytics
stores, your transactional stores. For analytics, many different kinds of back ends. All
this increases the barrier to use by a user.
You're going to have a digital shoebox. Or some people call it a pool or a lake or
whatever the metaphor you use. One place where you can put a diverse range of data, be
it documents, be it multimedia, be it traditional relations, be it streams, be it graphs, right,
and it can put any amount of data in the same place. It will just expand and hold it. And
not only hold it, allow you to retrieve efficiently. Okay?
Second, that efficient part is going to push the frontier in terms of the technology. A lot
of the scale-out to date owes to work in parallel databases around work on parallel
distributed file systems like GFS.
And the intuition essentially is spray your data uniformly across a bazillion machines and
then shard your computation accordingly, break up your query, for example, into little
tiny pieces on the shards of data that that larger query touches. So if you scan a file and
the file is on a thousand machines, then your query becomes a thousand little baby
queries, each running on a shard of your data.
Okay? That simple principle carried us a long way. But now in addition to this spread,
there is a depth. If you look at data that's on local disk, all the file systems we talk about
essentially stop there. There are three copies on the local disks of three different
machines. That's as far as the metadata and HDFS gets you.
But now, whether something is in the local disk or main memory or SSD or MDRAM
makes a huge difference. Not only performance-wise, these different forms of memory
are becoming cheap enough you can have scads of main memory, for example, loaded
onto a machine. And if you don't make use of it wisely, you will not meet the
performance criteria that you need on many of these applications.
Furthermore, for an expense of storage, you don't stop with a single machine, of course.
You have remote storage. You have tape. You have all kinds of storage that farther
away, slower, but much cheaper. Right? This means if you really want to store
everything at all without worrying about the cost, your buffer management becomes
crucial. You're not going to see buffer management on steroids. You know, very far or
very near and across tens of thousands of machines. This is the challenge of tiered
storage. Right?
And, oh, one last little detail. Historically database buffer managers have been tied
crucially to one class of workloads, SQL queries. Now, as I'll describe in a moment,
you're going to have a plethora of analytics from graph processing to streaming to
machine learning to SQL queries, and you need a buffer manager that can take you a long
distance towards all of these workloads.
So as a database person, you know, my cup floweth over. This is lifetime job security.
Okay. Of course how beautiful your store, people will always have good reasons for
keeping their data in other places as well. Maybe on-prem. Maybe in some specialized
store that they're locked into for whatever legacy reasons. It's up to us to make access at
least, maybe not every bit as efficient, but access, functional access available to all data
no matter where it is.
Now, on top of it, from the end users' point of view, they care about many different
programming paradigms. SQL is ubiquitous if you take things like Hive, Pig, they're just
variants of SQL, in my opinion. MapReduce has gained a lot of mind share. But let's
talk for a moment and think about MapReduce.
How many of you know about MapReduce already? Okay. Great. So I can be really
brief. MapReduce is just SQL's group by. At a very surface level, that is indeed true.
But there's a bit more to it. First, when you group the underlying rows in the group, if
you preserve them as opposed to distilling them through an aggregate, that is the map
step. And, subsequently, within each group, rather than applying one of a predefined set
of aggregates, if you could write arbitrary user code to run against a partition, that's
reduce.
So MapReduce and SQL group bys indeed are very similar. The language construct
owes, of course, to lambda calculus. But if you take all of this, you're still missing the
essential contribution in MapReduce, which I think are twofold. First, an enormous
range of user code can be effectively parallelized in this manner. That's a deep insight.
Right?
For the class of things people did at the Web search companies -- Alta Vista, Bing,
Inktomi, right -- this was well known. The genius of the MapReduce abstraction was to
make it broadly available.
The second step, when you parallelize at the scale of thousands of servers, traditional
database systems didn't go above tens. Right? The scheme of restarting in any one-part
fail is not good enough. You have to plan for failure. So failure-centric architectures
were the other big contribution of MapReduce systems.
But language-wise today, MapReduce is folded into SQL extensions like Hive or Pig or
here in Microsoft something called Scope. Okay? And if you look at the statistics, 95,
98 percent of all MapReduce queries at Facebook, at Yahoo! they're really Hive or Pig
queries that have been translated into MapReduce. So end users using MapReduce is a
vanishingly small case. Right? The revolution has been somewhat different. But those 2
or 3 percent where you still have to default to MapReduce do turn out to be significant,
which is why all of these languages support essentially MapReduce in slightly different
syntax.
Okay. Coming back. Stream processing. Realtime analytics is now again becoming
mainstream. Things like roll up, drill down BI. It's a $2 billion business for Microsoft
alone. Right? As a market, it's much bigger.
Machine learning is growing explosively. Right? When you have so much data,
everyone kind of understands, hey, human analysis, manual analysis can't cover all of it.
What can we teach machines to look for on their own.
If you take this diversity of analytics over all the variety of data that you expect to store
here, the underlying computational substrate, how do you build it, are you going to build
a different stack for SQL, for stream processing, for machine learning, or can you share
some steps in between. Right?
This compute fabric is an area where there's a lot of research going on today. Okay?
And I'll say a little bit more about it. And this is the new look at virtually any big data
player today. You'll see some variant of this slide in how they think about the space.
Okay? It's not unique to us.
So here's a link actually -- here's a link actually where you can go look at what many,
many people are doing in this space.
All right. In the second half of this talk, I'll talk a little bit more about that compute
fabric. And essentially here's the question I'll try and address. Given that there's
enormous amounts of very diverse data, being analyzed in a single application, no mind
you, right, at different stages in the pipeline, I use SQL, I use machine learning, I use
streaming systems, I use graph analytics, what's the system underneath? Am I going to
see a single SQL box that all of a sudden does streaming, does machine learning, does
this, does that, every, or am I going to see a whole federation of systems built completely
standalone which I mix and match on my own dime.
I think either of these is unrealistic. The first will simply not allow for agility that we see
here in this space. The -- and the systems will simply not be usable. Right? SQL by
itself was being criticized for having everything but the kitchen sink. When you take all
of these other things and slap them all together, it's unmanageable.
The second alternative, go ahead and build your stacks by yourself in each domain -- I
mean in each type of analytics and let the end users figure out how to mix and match.
End users will lack. They cannot. Right? You need to make interoperability across, say,
SQL and machine learning a first-class design consideration and make the edges
seamless.
That means the underlying computation fabric needs to satisfy some criteria. The
intermediate computation in a SQL query must be something you can pipeline to a
MapReduce step, for example. The iterations in machine learning, you must be able to
seamlessly pass along.
So these kinds of considerations, where do they lead us? What I think is going to happen
is an evolution of what we are already seeing in Hadoop with YARN -- how many of you
have heard of YARN or Mesos or -- okay. Fewer people. So let me speak briefly about
this.
If you take MapReduce, the original implementation of MapReduce in Hadoop, which
was an open source implementation of the GFS MapReduce stack from Google, Hadoop
was largely done at Yahoo!, the original implementation of MapReduce was monolithic.
Right? The programming paradigm of MapReduce, the implementation, all the way
down to the bare metal, it was one composite thing.
Then along came Pig which was Yahoo!'s variant of SQL. In parallel there was Hive
which was Facebook's version of the same thing.
These higher level languages, they had a choice: Do we implement them from scratch?
Oh, boy, it's going to take a lot of effort. Right? What if we just translate Pig queries
into MapReduce programs? That was the original implementation approach. They
simply translate it and then executed the translated program.
Now, there's a price here. If you take a Pig program and translate it into MapReduce,
instead of the same Pig program and as a human being rewrite that in MapReduce from
scratch, the difference between these was in the original days easily over a factor of 10 in
performance. Today it's still a nontrivial tax you pay. Not as big.
It didn't matter. Right? As I told you, from the very earliest days, people started using
these higher level languages more than MapReduce, to the point where today the usage is
in the high 90 percent of MapReduce programs being translated. Which led to the
obvious question. Maybe we should think about implementing languages like Pig and
Hive natively. Then again, do we need to start from scratch? Are there things that are
common? Yes, there are.
So think about how these programs work. A user submits a job. You know that the job
is going to be broken up into baby jobs running against parts of the data. To do so, you're
going to have to go to the underlying cluster and say, hey, give me compute containers
with some constraints.
So let's take a very simple example. I want to scan a very large file and count the average
of some column in that file. Right? So my data is broken up across a thousand machines,
and what I'm going to go and say is, hey, give me a thousand slots on each of these
thousand machines so each slot I can use to compute the local aggregate and then I'll sum
them all up.
So first you go negotiate. You get the resources. Then you install the particular
executable for your query in each of these thousand slots. And then off you go.
What you put into those thousand slots is specific to this particular query, more generally
to SQL. The SQL engine is something that does this. But the actual negotiation with
some owner of the underlying resources, this is common. This realization led to resource
managers like YARN, which was the Yahoo! version; Mesos, the Berkeley version;
Corona, the Facebook version; Omega, the Google version -- I mean, there's a pattern
here, right? Everyone saw the same writing on the wall.
And the net of it is now the resource managers form a common substrate. No one has to
do resource management separately. They sit on top, get their resources. From their on,
they do their own thing. Okay?
This also provides some other advantages. If you have a multi-tenanted cluster, priority
can be set by whose needs are most important regardless of whether they're issuing a
SQL query or a machine learning query. You can interleave very different types of
computation.
You can have heterogeneous [inaudible] pipelines all executing on the same cluster.
There is no notion of these machines do only SQL; those machines do only machine
learning. Right? Lots of things followed. Okay. So that's resource management.
Let's talk about YARN very, very briefly. I said some of this already. Right? Multiple
kinds of engines can share the same pool. And I use the word container or slot
interchangeably. Right? If you take this, the underlying algorithms for scheduling
deciding whether Danny's request trumps Raghu's request -- it should, of course -- this is
a lot of research. Okay?
Now -- I need to have a technical slide here. Come on. I'm not wearing a coat. So here's
an example of some work that we actually did on preemption. And I'm just going to give
you a brief feel for this.
It will also help me illustrate some of what goes on in MapReduce. So in MapReduce, it
works in waves. First you take your data, you partition it in a way that's appropriate to
your problem using the map step. The mappers all execute as independent tasks. When
they are done, the reducers come along and take each partition and do something else,
and what you are left with is a collection of partial results which you sum together and
you get your final result.
Let's look at this graphically. Here are the mappers, each line is there's a mapper that
starts here and ends only here; this mapper begins, ends, begins, ends, and so on. These
are all mappers. Each row here is a particular machine and what's going on on that
machine. Okay?
This of course is time. These are stragglers. It's well known that the [inaudible] of map
jobs in a map phase have a huge impact on overall performance. Let's see why.
In standard MapReduce -- in all of these systems, by the way, to begin with, preemption
was not part of the story. So once you allocate a slot to a task, you simply wait till the
bitter end until that task completes and returns a sort to you.
So in that regime, let's see what happens. You allocate all these slots, the map steps of a
job. They're all going to go on. Some of these finish before others, so you can reuse
some of those slots, right, so some of these rows are overlaid on some of these rows.
Okay? In fact, on some of these machines. The reduce step starts. But now somewhere
here let's say those reducers get blocked because they're waiting for the straggler to
compute.
Remember, in a group by, if you're consuming the result, you need to be sure your group
is complete. When you get there, you hit a hard wall. In sorting, the exact same story.
You guys know what I'm talking about.
So here these machines lie fallow. Why? The choice is to nuke all these jobs and start
over -- which they do sometimes; it has its own consequences -- or you just wait to be
unblocked. When finally the straggler completes, you can resume and then the remaining
reducers are scheduled as well and the job eventually completes.
Now, this red area starkly illustrates the impact of a straggler in the absence of
preemption. Okay. So you can begin to see the subtleties under the covers.
Okay. Now let's look at this option instead. If you could preempt, if you could say at
this point, you know, save the state efficiently and use these machines to do something
else because you're blocked on these particular tasks, what could you do, right?
If you were to do this, you could continue, you could use -- you could get some of the
other reducers going on the same physical machines here, right, those -- you know, I wish
I [inaudible]. In that case I will have to use this. But here. If you look at these reducers,
they're running on these machines which have now been paused on these reduce tasks.
You get what I'm saying. And therefore instead of sitting on your fanny, you could
actually be doing some useful stuff, which means the whole thing completes sooner.
Okay?
This is another assumption that we can efficiently checkpoint the work done by these
guys. So you need a checkpoint mechanism. But if you had that I you get a significant
improvement in performance. Right?
This is the kind of stuff that's going on as we speak. These systems, for all the
impressive characteristics, there's a lot of traditional stuff that they need to incorporate let
alone break some new ground in many ways.
So this particular scenario we showed, it had a one-third improvement.
All I want to say in the next few slides is, you know what, all of this work, we actually
contributed to Hadoop open source, Apache open source. Here are the [inaudible] for
those of you who are interested. This is just to kind of make the point Microsoft is
serious about giving back to open source. So we are very much involved in Hadoop both
as a consumer -- we offer a product called HDInsight, which is similar to EMR, right?
So we consume from open source. And, as you saw in this example, we also give back.
In fact, several of the people on the Hadoop council commit as they work at Microsoft.
All right. Switching gears now. I've taken you all the way through YARN. What next?
Well, let's follow the trend. I now want to use YARN or a resource manager like YARN
to build SQL, to build machine learning. Have I gotten everything in common that I
could? From here on, do I have to build custom versions of each, of each stack?
Hopefully not.
So the REEF project was an attempt to push the common substrate one level higher. But
on a lot of things, for example, if I have a thousand tasks, which of them died, just
monitoring them can be a store. Restarting automatically can be a chore. Checkpointing
can be a chore. How many of these things can you package into a collection of libraries
that are common? What are the further deeper benefits of doing this? That's what the
REEF story is about, and I'll try and walk you through that.
Okay. Again, there's a Microsoft project done in CISL now going on in the big data team
as well, and again we have own sourced it through Apache.
So let's take a concrete example of machine learning. User activity modeling. You want
to find out what a particular user likes. Okay. And you infer this by looking at the pages
they browsed, the queries that asked, and the ads they ignored or clicked. Mostly people
ignore ads.
If you look at the number of pages per user, a few tenths. The number of possible pages
there are in a typical Web site, millions. Right? The number of queries that users
collectively ask, hundreds of millions, where they actually ask in a given user, relativity
few.
The point being these are highly sparse. But these there observations you have to work
with.
How do you learn from this? Imagine this time window in which the users' activity is
overlaid and you gather this from various kinds of logs, Web logs or search logs. And in
any given window, you look at the things that you can observe. Oh, the user just issued
this kind of query or visited this kind of Web site. Based on that, what are they going to
do next? That is the crucial question. This is what user activity modeling is going to
provide for you. Features that given this will allow you to predict this. Okay?
So if someone visited finance and issues a query about the stock market, chances are
good they'll sign up for E*TRADE. Okay? I am oversimplifying, but you get the idea.
How do you build such models? You take your logs and slide this window step by step,
and each slide of the window, right, gives you a case to learn from.
So the very first step is to gather all these logs, clean them up, deal with things like robot
clicks and winnow them out of the way. And then slide, slide, slide, each time extract a
row in your training database.
This is a ginormous task, and it is proprietary to any kind of modeling. Okay.
And if you look at the times involved, applying the data takes several of us. These are
somewhat old Yahoo! numbers, but you'll see very similar numbers from Bing or from
Google. Feature and target generation, right, roughly each window, feature window has
about a terabyte. This work, extracting the cases, takes about four to six hours. The
actual model training, what we write papers about, this takes one to two of us. And not
just for a single model, while you're at it, you may as well build hundreds of models,
okay, because you're going to evaluate them all through bucket texts.
Just think about this for a moment. When you think about the nexus between big data
and analytics or machine learning, this slide is something that should be burned into your
soul. Okay? Machine learning academically is all about the algorithms. Machine
learning for real is all about the plumbing. Right?
And this is why having your scalable data management systems be seamlessly integrated
with your machine learning frameworks is crucial.
Okay. Let's look at this in a bit more detail. One of the things I looked at at Yahoo! was
phishing, mail spam, and the like. So here's an example about a spam filter. You have
your inbox and the algorithms, the spam filters try to take a given user's name and fork
them into real mail and spam. Okay?
The user actually sees this through the mail front end. Okay? Occasionally you screw
up. You take spam and put it in the inbox. Even more occasionally someone will
complain. They'll mark that as spam. And that's your signal to learn from.
Okay. Now let's fast-forward. What does it take to learn from that? You need example
formation, modeling, evaluation, the same cycle you saw earlier. And if you look at
this -- I'll start moving a little bit faster here -- the example formation can really be done
reasonably efficiently using systems like Hadoop. It's really a gigantic join between the
mailbox and the spam signals.
You can do this work. Hadoop helps you do this work. Okay?
If you take a modeling step, different story. Modeling is not so easy. Modeling is
massively iterative, and MapReduce doesn't really support iteration. In between
iterations you need to write to disk. So nowadays you have seen things like Spark and
Shark that try to keep things in memory to avoid writing to disk in between iterations.
Okay?
The good news is MapReduce supports one of the basic models for machine learning, the
Statistical Query Model of Michael Kearns, pretty well. Other models, they don't map so
well to MapReduce. Net of it, this iterative cycle, even a single model is iterative, but
there's an outer loop, a prior model, observers, updated, try again, do different kinds of
features, try again. This whole iterative cycle is not real supported in MapReduce.
So going a little bit faster, net of it is what I just told you. It can be used, but not easily,
not officially. What is this led to? People cheat. People are infinitely creative. They use
map-only jobs, no reduce step. What the heck is going on here? They're essentially
getting threads from a ginormous thread pool and having a party. Okay? You can do
anything you want, just give me the resources and get out of the way.
But if you do this -- I hate animations. Okay.
These are examples. For those of you familiar with machine learning, there are things
like (All)reduce and decision trees which also could impose a complex communication
structure across many tasks.
You could design your own. Each of these is a map-only task. You establish your own
communication channels. If one of those boxes dies, you're in doo-doo. You deal with
it. All of these things that MapReduce supposedly gave you as part of the framework you
do all over again because you're abusing the system to make it jump through hoops it
wasn't designed to. Okay?
So all said, let me cut to the chase here. Yeah. I can go on to -- through more and more
of these examples. All this leads to unhappiness on both sides. The people who do this
kind of abuse, not because they want to be abusers, because they don't have options, they
have a fault tolerance mismatch, so they have to redo fault tolerance or live without it.
The resource model, and they're just building their own, there's no notion of a tree, for
example, their integration of MapReduce essentially should treat them as map-only jobs.
They have to deal with networking, cluster memberships, bulk data transfers. Pretty
much all of the work.
For the cluster, life is not so great. For example, the network usage patterns in these
kinds of applications are very different. This leads to problems. [inaudible] consider
what happens when you need, say, gang scheduling. If you need a collection of nodes to
be given to you in order to proceed, this happens, for example, in Giraph, graph
processing system, what does Giraph do? It gets resources one by one from the
underlying resource manager until it has enough.
Meantime, it squats on the resources it has. This means the utilization with the system
[inaudible]. Everyone else is affected because you have given someone resources
piecemeal when what their requirement is is an all or nothing, give me at least a hundred.
Right?
So how do you deal with all of this? What we really need is this intermediate stage that
complements YARN, right, to facilitate development of applications on top of YARN
and lets you build pipelines of these different kinds of computations.
So let me illustrate this with a concrete example, and then I'll conclude.
In this example, the job driver is really the control in the example program. Activity, it's
the user code. This is what you would execute in a MapReduce task, your actual code
running there. Evaluate is simply the container. It's a REEF term just saying it's a
REEF-controlled evaluator. Okay?
But this, let's consider the simplest of tasks. Let's say you want to do distributed shell.
Run the same command on two different machines. Here's what would happen in a
scenario with REEF. You want to run this command. You'd first come to a cluster that
knows about Hadoop, YARN, REEF, all these things. And when you come in, the client
is -- you know, your client executing trying to do distributed shell.
You submit your job. The first thing that's going to happen is you will launch a REEF
container on one of the machines. The very first thing that REEF container will do is to
launch this special program called the job driver which includes your code to orchestrate
the logic of your program.
The job driver now negotiates on behalf of the job, right, with the resource manager. The
negotiation results in tokens it can use to create tasks. It will get containers through the
resource manager, and on these it will install some libraries, it will install the actual user
code, right, your task, and it's able to run this task on this machine.
Imagine the same thing happening on all the other machines in parallel. Okay? Now
your command is running on these two machines. And instead of running dire, you could
be running anything at all you wanted across these two machines. Or 20,000 machines,
for that matter.
These are going to now communicate with the job driver through a regular heart beating
mechanism which primarily allows the job driver to take over the mundane task of baby
sitting these bazillion tasks, restarting them should one of them fail.
And the restart may happen on a different machine. Because if it really -- if a task fails,
the job driver will detect this, go back to the resource manager, get another container,
restart the activity.
If you want the state that was lost to be available, you will explicitly use one of the
checkpoint commands to save the state durably across machines and that will then be
available when the job is restarted.
So in a nutshell, these are all capabilities that make it possible to write, say, a SQL engine
or a new implementation of Giraph or a new implementation of a machine learning
algorithm at scale.
One last thing, and this is a really, really, really important thing. Okay. Let me get past
the -- ah. Here. So let's say this particular task completes. You still have the evaluator.
So REEF essentially acts as the middleman between you and the cluster manager. And
when you complete one of the tasks you have installed in a container, you have the
opportunity to install a follow-up task.
So think iterations in machine learning. After one iteration completes, your container is
available with metadata about what state you have left over from that iteration. You can
install the code for the next iteration. The data never goes to disk and comes back.
You're iterating in main memory.
This by itself, either doing this or running in something like Spark, gives you a 30X
improvement in performance. Okay?
Now, when you pull all this together, what you effectively have is the REEF system. In
the interest of time, speaking of checkpoints, I'm going to checkpoint right here. Okay?
For those of you familiar with Rx, there's a lot of similarity with Rx and the underlying
APIs. Net-net [phonetic] -- if you're interested, all of this is available through GitHub.
Send me or Dennis mail and we'll send you a link.
Let me pause here. Today the main message I want you to take away, over the next five
years everything is going to evolve around data. The kind of data is diverse. Right? It
could be data that's very specific to what you're doing. But what is universal is our world
is becoming data driven. And this is going to require us to develop systems to manage
data of a diversity and a scale that's unprecedented and to provide analytic tools -- here's
one other point. If everything is data driven, a corollary is domain scientists who care
about the domain science, not learning ever-increasing complicated versions of SQL and
this and that, are going to be using these systems.
That means you need to build domain-specific languages. You're going to see a plethora
of analytics, is my guess. So having the kitchen sink is not an option. Having many
tailored systems, if that's where you're going, these tailored systems inevitably will have
to do some part of their work in concert with other systems. Right?
Everyone, for example, will have to live with SQL. Right? How do you support these
kinds of diverse analytic engines on top of a common compute fabric, a common caching
capability, a common distribution, checkpointing, fault tolerance, right? All the major
players are thinking about this in the cloud, Google, Amazon, Microsoft, they're all in the
game. It's a fun time to be working on this kind of stuff. So I'll pause there.
>> Dennis Gannon: Let's thank you Raghu.
[applause]
Download