>>: A lot of the [indiscernible]

advertisement
>>: A lot of the [indiscernible]
>> Wenming Ye: Well good morning. Welcome back.
>>: A lot of the [indiscernible]
>> Wenming Ye: My name is Wenming Ye. I am a senior research program manager, and I am part of
the team here at the Azure for Research program. So the first session here is on analytics—analysis and
simulation—and I have a personal interest on HPC, as well as big data. So let me introduce three great
speakers here today, and the first here is mister … doctor Roger Barga, and he is a senior manager here
on one of the machine learning teams here, and he was formerly a Microsoft Research member.
>> Roger Barga: ‘Kay, thank you. Alright, so hopefully the microphone is working now?
>> Wenming Ye: Yeah, awesome.
>> Roger Barga: And if the screen will come up? So I think everybody here has a strong interest in escience, and it’s something I share as well. And in fact, it’s kind of interesting to look at, and what I’ll be
talking about, really, is the intersection of e-science, data science, and machine learning. And it’s
interesting, if you look at these companies that call themselves data science companies, to say, “Hey,
where do these people come from? Who were their first hires?” DJ Patil—who basically built the
LinkedIn data science team and has written several books on that—he came from National Labs, and so
many of the actual, very first people in data companies—at Facebook, Microsoft, and other
companies—come from National Labs, because what we teach in e-science—and what you teach in
science is how to get inside out of data, how to manage data, large amounts of data, make sense of it,
wrangle it into some shape or form—is exactly what you need for data science. And here, you see
something interesting happen in the data science community: we’re formalizing curriculum. I helped
build a curriculum for data science in collaboration with Bill Howe back there for University of
Washington. I’ve since had a chance to talk to other universities about how to build data science
curriculums, and you look at the rigor—and in fact, at University of Dundee, I talked with them as well
about how to build a data science curriculum, and they’re actually spawning it off of their e-science
group—and it’s interesting here to say, “What can the e-science community—look back—and what
trends can you tap into and tools can you tap into that are rising in data science and pull that back into
the sciences so maybe there’s a virtual cycle going on here?”
So my talk’s a little bit about that. What I wanted to share is a little bit of about a Microsoft perspective,
and in particular, about machine learning, ‘cause there is a silent—it depends on what size of the
company and where you’re working—but there’s a silent revolution going on in computer science
around machine learning. And in fact, we’ll argue—and Gartner, Forest, and others who are tracking
this—to say it’s going to be one of the most impactful technologies we have; it’s going to … I mean, it’s
going to have the most impact on IT and systems over the next ten years—machines that actually learn
from data. And we’ve got lots of that. Just to frame it—just to step back a little bit and say, “What
motivates this?”
You know, machine learning: one of the first, most successful applications of machine learning—it’s kind
of a preamble we put in a lot of our talks—goes back to, basically, handwriting recognition and the post
office. And the very first systems that would do this—or optical character recognition—there was
basically heurist … they’re rule-based systems: imperative, declarative code that would actually look for
the letter, actually look at a rule base, look at the features, and of course, every variation that was
thrown at the system and that broke it, they opened it up. A bug request and a feature … a developer
had to write more code—clearly, this was not sustainable in the long term, but that’s how, basically, we
knew how to build systems, to the best of our knowledge.
And it wasn’t until, basically—I would say—the late eighties, early nineties that they started to treat this
as a machine learning project problem. And in fact, Darp actually had some funding to actually fund
people to do work in this area. They said, “Hey, let’s actually take a machine learning approach. Let’s
just take every instance of the letter—we’ll build a training set for it.” Because again, machine learning
systems can start out real crappy, but they get better with experience. And they … this creates the
training set or the experience that this machine learning system is going to learn from. So we’ll take
every known example of this letter that we can find; we’ll do various techniques to rotate and add noise;
and of course, the labels will stay the same—they’re all just variations of a two—we’ll then feed this into
a machine learning system; and voila, out comes a pretty okay—you know—system. We didn’t have to
write a line of code; we treated this as a data problem; and we used machine learning to understand the
correlations this signals inside of the data.
And this is the systems that are deployed today. And in fact, it fails; what happens if a letter gets kicked
out? Well, somebody actually labels it, puts it back in the system, and cranks … turns the crank again,
and the system just gets smarter. So mistakes actually make the system smarter and improve the
performance. In fact, a modern-day implementation of this is the Bing language translator that you can
download off the Windows Azure store. And no matter where you’re travelling, you basically … you can
take a snapshot of a menu, and it’s going to translate it into the language that you choose—all through
machine learning. Again, no lines of declarative or imperative code written; it was treated as a machine
learning.
Now a more modern implementation that’s something you just could not have done by writing code is,
basically, the Microsoft Connect. Now obviously, this has been a great commercial success, but if you
step back and looked at the technology advance that Microsoft Research had to contribute to this to
make it happen: basically, think about all the different body postures—all the different body shapes—
that have to be recognized—I mean, this is a really, really diverse dataset—different ages of … different
obstructions, pulling the background from the foreground, people overlapping each other. This whole
implementation for the Microsoft Connect sensor on the Xbox was treated as a machine learning
problem, where they capture training sample from an individual wearing a suit like this. Close to eight
hundred million were actually captured, and even that wasn’t enough to get the performance out of the
machine learning system that was needed for the Connect sensor. So they actually used simulation and
other machine learning extensions to create a larger training set—just like we did with letters and
handwriting recognition—until we had close to two billion training samples which were fed into a
machine learning service to then actually create the Connect sensor. And that’s actually just a subset of
the information that comes off the Connect sensor. It’s really striking how much information it’s pulling:
how hot you are, whether how hot … how your blood pressure has gone up, you’re perspiring, so they
can actually gauge how many calories you might … predict how many calories you might be burning.
Again, it was all treated as a machine learning exercise.
So as we think about some of the problems we face in science, where we’re awash in data, and the
chance to perhaps label that data and say, “This person has cancer; this person is developing this.” If we
get enough data, could we treat these as machine learning problems? Which is a very compelling value
prop. So I think—you know—machine learning allows us to solve extremely hard problems better,
extract more value from big data, and really drive a shift in engineering culture. You don’t have to get it
right once, but how do you set up a capture cycle, retrain cycle, and a loop—a feedback loop for your
system—so that you can actually start building that corpus of training data that you need?
But let me talk a little bit more and transition a little bit into data science, and talk about a more recent
problem we had to face. And this was, basically, hosted exchange for Office 365 in our Dublin
datacenter. Once again, even Microsoft waded in with basically a handwritten rule … set of a rule …
about five hundred rules that monitored the service … monitored the servers, the disk drives, the CPU,
the network, and the operators promptly turned the alert system off, ‘cause it was wrong most of the
time, and raising false positives, and actually missing the really bad stuff that brought us down for
periods of time. And so what they turned it into is basically a data capture, where they captured time
series data coming from each and every one of the sensors, and when an error occurred, all the
operator had to do was basically—since they knew exactly what went pair-shaped—label it, press a
button, tell us what happened. We’ll capture all this data; we’ll send it off to a machine learning group;
and basically, from hundreds of thousands of machines, hundreds of metrics figuring out which signals
actually correlated. And actually built an early warning system out of … the previous system’s been
completely discarded, and the machine learning system provides the dashboard, and when new events
that were not anticipated occur, we actually build a new training set for it.
But that’s really not the whole story. There’s really something going on here behind the scenes, and
that’s about data science. If you look at all the applications that I’ve just described to you and some of
the more modern applications across the company, there hasn’t been really—with the exception of
deep neural networks—any real advancements in machine learning. They’re the same algorithms. Yes,
we have lots of data; yes, we have scale; but what’s really coming into play here—‘cause again, this is
kind of the usual suspects—we actually … there’s a survey of twenty-five hundred predictive and
machine learning applications that have been deployed in the industry, and the usual suspects of the
top eight algorithms are surfaced in most of them. So it’s not like the machine learning technology’s
changing, but what is happening: because of more data and our ability to pull better features, in the
right hands, a very simple algorithm can actually have incredibly predictive power. This is where data
scientists comes in; this is where somebody who can actually make sense of the data, figure out how to
do—we didn’t actually capture each and every point, we used time series, smoothing; we created
features from this: high, low for a ten-minute window to create a good set of features, so even the
simplest linear algorithms can have a chance at actually giving a prediction that a … that the system is
going pair-shaped. This is the role of the data scientist is what I’m trying to get at.
This is not something new, by the way. Again, I don’t know how many of you have tracked this, but
basically, data science: probably the first real mention of it was Bill Cleveland back in Bell Labs—he
actually wrote a data science action plan back in 2001—and he was proposing a curriculum program to
get it … this woven into academic institutions. JISC did a report in 2008, NSF had a report as well in 2005
all talking about data science, primarily in the context of the sciences, not business. It’s just business has
been the one that’s picked it up and embraced it. And there are a couple of journals, so this is not … this
data science thing is not really new. It’s just been coined in the last few years—the term and the buzz—
but this has actually been talked about, thought about, and in fact, Bill Cleveland’s original plan is
actually still one to follow in terms of how to build a curriculum program for data science.
But there’s something going on here. I mean, I want to talk about the complexities that these people
face and think about tools we should build to support this. You know, to go … it’s a very different
workflow—it’s not BI—you have to kind of define an objective function. Am I trying to predict failures?
Am I trying to predict churn? And what’s a proxy in the data for that prediction? ‘Cause guess what?
There’s probably not going to be a column saying, “I’m going to fail” coming out of your machine. You …
there’s probably a number of columns that you’re looking at, which are indicators of the health of the
machine—similar for patients or similar for anything you’re analyzing. You have to get smart at looking
at your data, saying, “What’s the objective? What’s a proxy in my data that I’m really trying to predict?”
Then gets … get lots of data, and you have to realize you have to throw some data out; you have to
massage it and create features; you have to sample, create moving windows—I mean, you have to bring
that data down to something that the machine learning algorithm can make sense of. We spend a fair
bit of time talking about that in our data science program: how to … what are the common patterns that
one sees? And then finally, frame it as an ML problem. Is … are we gonna treat it as a classification, a
binary or a multi-class, a regression, a ranking? Are we just gonna do time series forecasting and maybe
some combination of it? This is what data scientists have to do; they have to be able to shape a
problem from raw data into something which is tractable.
Finally, with a machine learning algorithm, and build modelling dataset—may be wrong—and it’s a
highly iterative process. So once you’ve got the data, and you build a training set—be it for the
handwriting recognition or a server about to fail—you do some feature selection, which means you’ve
gone from a big number of columns down to something relatively small, and you train a model. And
then you say, “Okay, well how’d that model do?” You hold it … show it some data it’s never seen before
and see if its predictive accuracy is high. Evaluate that, and—you know—maybe you have to go all the
way back and redefine the objective function or go get more data. This is a highly iterative process
today—you know, a lot of experimentation. In sum, if you step right back, what we’re really teaching
people is how to weave experimentation and data together, and get very used to running experiments
and having a hypothesis over data.
And again, as we’ve talked to data scientists, they have their favorite algorithms; they like to choose
between them; no single one is good enough; and in fact, a single algorithm in the right hands can
probably do it all for an individual who knows how to make that algorithm jump through hoops, but you
talk to a different group of data scientists, and they’re gonna want different techniques. So you really
have to give them a variety of ML algorithms. And then finally, it’s about experimentation. Just because
you have a good dataset that predicts well—well, you know what? If you chop the dataset slightly
different into a different in-training testing set, the results might go pair-shaped, but they might be just
as good, and so this is really about running fast experimentation and trying out new ideas really fast,
over and over again. These are the elements of data science: a very distinct workflow, having an
intuition about what carries signal in the data and pulling that out, trying it out, trying different ML
algorithms, and running an experiment very fast—actually see if it actually has predictive capability.
And for those enterprises who are looking at this—and I’d even argue that the labs who are looking at
this—are going, “Holy cow. First off, we have a shortage of people who know how to do this.” And we
as Microsoft are looking at this and saying, “Hey, this is just too complex today. We need to think about
how to make this simpler.” I know some of you in the research—I’ve seen a few research proposals—
are thinking about how to make this simpler and build tooling. So what I’d like to show you today is
actually something we’ve been working on—because again, I’d really like to get my e-scientists or my
data scientists out of software development; if I can minimize how much code they have to write, I can
increase their productivity. I need to give them tools for data exploration—that’s how you get the
insight: “Oh, that feature: that’s an interesting feature; it’s actually correlated with the target I’m trying
to predict. That’s good; I’m gonna try it.” You know, I want to make that available, but you should have
to learn five different stats packages, ‘kay? Machine learning: if you talk to people who really do this for
a living, it takes two or three packages to have the right algorithms to have that breadth—I’d really like
to make that all in one place—and then, good experiment and data management—if we could actually
think about services to provide that, then I think we could actually put this in the reach of mere mortals
who understand the science in the problem and can actually go through and do that.
So let me show you something we’re working on, just to give you a glimpse and maybe some inspiration.
Make this really quick. So if you have a web browser, one of the first things we’re trying to say is you
shouldn’t have to install any software; you should be able to go to a cloud service, where your data can
be uploaded, and you can have scalable storage—and I could have used chrome or anything else. And
the first I land in is … this is a repository for all my models; it’s the … it’s my project space that I’m
working on. And I actually may be working on multiple project spaces, and I can choose between them
and go over to my colleague’s, who’s invited me to go to help him solve a problem. So if I’ve got two or
three projects I’m working on, I should be able to actually snap between them fairly seamlessly. And in
fact, if I decided that there’s another PI I want to invite, I should be able to go over here and say, “You
know, I’d like to actually take someone and make an invitation,” and actually take your e-mail address,
type it in here, and say, “Okay, join me.” And you’ll get an e-mail in your box, and with your browser,
you could join me here, and together we could start working together on the modelling. Because really,
data science and e-science is really about collaboration and pulling the right people in. But let me go
through, and I realize I’m somewhat time-constrained, so in the interest of time, let me show you just
how fast we can build a model together. We’ll say … may … if I can get a dataset up there, I can say new
dataset, I can put it in from my local file system—‘cause maybe someone’s just given me a sample of
data—or I say I want to do a new model, start one from scratch—we call them experiments, ‘cause
again, data science is about experimentation. If I wanted to read data from Azure storage, I don’t have
to know about how to talk to Azure storage; I can just write my credentials in here and write the query.
If I wanted to read from Hive, I can type in … just select Hive, type my Hive query in here, and read data
from Hive. If I wanted to do it from SQL, I could just cut and paste my SQL query here, and this thing will
handshake and run a protocol with SQL to pull SQL into it. That just takes a lot of programming off my
plate. Or I could do something like just—here’s a dataset that I’ve got up here that I’ve already
uploaded—do something really simple. And if I wanted to look at that dataset …
>>: Question …
>> Roger Barga: Yes, please?
>>: So are you downloading data into this environment from the cloud or …?
>>: Roger Barga: Yes, yeah.
>>: Okay.
>> Roger Barga: So I can pull … I can even pull if from on prim, and if it’s a big dataset, I can run samples
and pull a sample up here. As a data scientist, you want to give them access to all the data they sh …
that they need for their job. And if I double-click on that—I’ve just pushed a dataset here—there’s this
beautiful client-cloud connection; I can actually open a local client. And what we have here—we’re
gonna keep this in the e-commerce business space, not the science space—I’ve got two years of
whether somebody’s purchased a bike from me or not. And here’s all the attributes, and it could be
whether or not somebody got sick or whether they turned out on my company. It all gets down into
something I’m basically trying predict, and then features I have over that individual. And the real
question is: can I build a model that has predictive capability?
So let’s just go ahead and close the Excel sheet, and—you know—I might be really curious about: well,
what’s in that dataset? Any missing values? Do I have any me … do I have mess to clean up? And notice
that, basically, it just kind of tells me what I can connect up, and if I hit run, this is the first time I’m
actually doing any work on the cloud. I send this up—I send that DAG up to the cloud service, which has
a VM pool standing behind it—it’s done; it’s created html five; I can click on this … come on; and there
are my columns that I was showing you: bike buyer, the columns, how many unique values they have. I
can see that there’s no missing values … oop, there are a few missing values; I’m gonna have to clean
some of those up. And so I can basically pull in missing values cleaner. So again, these are kind of things
that you learn that you need to clean up. Go ahead and just run the dataset through this, and I can
actually say place with a median. That’s good enough cleaning. And now, if I wanted to actually train a
model, I’ve got to build a training set. And I can tell it what percentage I actually want to use for my
training set, and if … for those in data science or machine learning, they know stratification’s sometimes
a big deal; we’re actually gonna stratify based on some point in time or a column. So all my stratification
procedures can be drawn through a drop-down window as well, which I won’t bother at this point—I’ll
actually keep that false. But I will train predictive model; I will score it; I’ll send the training set to the
trainer; I’ll choose an ML algorithm; I’ll try a very early one—a simple neuro-network; we’ll see if we
can’t improve upon that in a second—and the trained model comes down here to the score; and now,
there was thirty percent that this model’s never seen, and we’re gonna give it a pop-quiz—we’ll just run
that down here. And this guy needs to know he needs help—he needs to know, basically, what column
we’re trying to predict. So basically type bike buyer, and assuming the demo gods and our service team
is … refreshed the service well overnight, I click run, this should go through and actually train a
predictive model for me, and I should be able to see … in fact, one thing I’m gonna try to do here really
quick … I didn’t do a metrics; I didn’t build the nice graphics, but let this run for a second. It’s training
right now, splitting up the data. Now the thing about this DAG is: what I don’t know is maybe I pulled a
Hadoop job in; maybe I’ve pulled a database query in. The runtime behind the scenes actually moves
data across these different runtimes … there’s a question.
>>: Is this all coming from a single UPM or across several nodes?
>> Roger Barga: Pool of virtual machines. So we actually look for any parallelism we can exploit, and we
exploit it. And in fact, we’ve got some scale-out Hadoop algorithms. I need to do a binary-class metrics
evaluator—something that actually makes some nice graphics for me. So assuming this model … and if I
hit run, a couple things to note: only the module that … only the things that change get re-executed,
‘cause it’s my experiment platform—I may just be … I may be working on trying to train one or two
modules. And in fact, I won’t get into it now, but I could go all the way back to lineage tracking and
track from the very first run. If I click on this, that’s my ROC curve, and so for those of you in machine
learning, you know that that diagonal—this thing would have been guessing. So I’m doing a wee bit
better than guessing, and this is my so-called confusion matrix; it shows me where I’m getting right
things right and my type one and type two errors. And if I was really curious, I could actually click these
and see the actual tuples that were … I’m getting wrong, or I may just want to try another idea—and this
is about the experimentation.
You know, I know that boosted decision trees are actually very popular and, actually, ver … can be very
accurate. So let me just copy, paste a subset—I’m gonna replicate the pipeline. I don’t need … I’m not
training a perceptron; I’m gonna train a boosted decision tree, and I’ve just shown you the ability to,
basically, within a few mouse clicks, I can actually try another experiment, and hit run. This is the
experimentation part; I should be able to try every ML algorithm that I know of within a few minutes on
this dataset to see if I can’t improve the performance. And then, I’d start doing feature training—
feature engineering—but I can basically run a hundred experiments in a day, and when you can put data
scientists into that immediate zone, things start to happen. You start to get models and predictions
built. And this is done. And like I … and we’ll see … almost done … come on, build that nice graphic.
That’s a much, much better—in fact, a good approximation for how this model’s gonna before …
performs like area under the curve—it’s eighty-six percent. This model has predictive capability, so if
I’m trying to predict whether someone’s going to get sick, or whether this is a certain anomaly, the
ability to have this in … and it’s highly interactive: I may want to run this model at different confidence
levels, because maybe I’m okay … I don’t want any false positives; maybe I’m okay with a few false
positives, and see how … if I can actually change in the confusion matrix change. If I’m actually sending
fliers out to somebody and wasting a quarter, I’m okay with a few false positives, so I’ll go ahead and
operate the model very aggressively. False positives goes up, but I don’t … false negatives go down; I
don’t miss anybody. But if I’m about to tell somebody they’re getting sick, I may want to be very careful
with the false positives, and really, really try to push this model down into a conservative zone and run
these people through another test, just to make sure they’re not really sick.
This is the kind of agility we’re trying to bring to data science. And what’s more is we … if you look at
what we’re thinking about as a group—together with our friends in Microsoft Research Connections and
elsewhere—what if this was a domain-specific library? What if the modules—the datasets that you can
read naturally—are those that are specific to you domain? So if you want to pull a dataset into you
domain, we have a reader for that, or we have a module for it. Or starter algorithms kind of represent a
starting kit for your domain, whatever that might be. And how do you build templates so people can
come in and have a workspace that’s being shared across an entire community? The datasets are
already there, or there’s readers for those datasets, or readers from the web services that the
community’s already stood up, and you can start building libraries of models that that community can
share as a research community and kind of build up over time. So we’ve already got a couple of groups
that are doing exactly that … that we have an SDK that allows you to basically build some of these
modules yourself—that I’m showing you here—build one of these yourself to actually read from data
sources that you’re are familiar with. Techniques: if you have, already, analytics algorithms—R packages
you’re using—we allow you to actually wrap up your code and actually create reusable modules that
other people can actually start using. So for those of you in workflow, it’s very much like workflow for
machine learning and data science, all at cloud-scale.
So that’s what I wanted to show and share, just for thoughts. And for those of you are really interested
in predictive analytics—which is kind of something that I’m really interested in right now—I’ve brought
you a reading list that you can have a look at and read some books—I think—which are great starter
books. With that, I’ll stop and field any questions people might have.
>>: What’s the status of that project?
>> Roger Barga: So right now, it’s being in … used internally for a number of our internal projects.
We’re actually doing machine analysis for our GFS datacenters, churn analysis for dynamics, and we’re
thinking about how we can bring this forward, both for researchers and also for our customers as well.
Yep. If we do in fact—and we are thinking very intention … I’m being very conservative about this, but
maybe at faculty summit, we might have more to share—on how to both have free access to this for
academics along with a curriculum around data science—slides that kind of show how to actually use
this in a data science curriculum—and free access for academics to be able to use it to teach data
science, but like I said, also take the SDK and really drive it home for vertical. ‘Cause as a cloud service,
it can be available to researchers anywhere who don’t have any more than a PC or a tablet—sometimes,
I give demos on an iPad and just do the connections and run the models; it’s kind of fun.
>>: So like, if instead of … so suppose I wanted to predict one image …
>> Roger Barga: Yes.
>>: … from another.
>> Roger Barga: Yeah.
>>: I can adapt …
>> Roger Barga: So GE …
>>: Repeat the question.
>>: [indiscernible]
>> Roger Barga: Oh. So to repeat the question that … the first question was on availability, I … which I
think the answer was self-explanatory about the question. The second one was basically about images,
and we in fact have a company—one of our inter … our preview companies that’s previewing it for us
externally—is actually taking two-dimensional images of tumors, and they’re trying to classify them.
Now, the … what was the code? Where did they use our SDK? They actually will take those twodimensional images and turn it into a feature vector—that was their intellectual property—everything
else, they just wanted box-standard ML, like most companies do. So they took their SDK—their code for
actually looking at images, reading them in from blob storage—and converted into a feature vector that
captured all the features of that image, and now they’re trying to actually … they’re just running large
volumes of cancer images to get a model that will predict a cancerous tumor versus noncancerous by its
shape, it attachments, and other features that they were able to get off of it. So yeah, very applicable.
Yes, please?
>>: Are you building collaborative features into that? So like, you share that workflow with me, and I
want to come look at it, and we can sort of chat. I’m like, “Hey, you should really add in this other type
of visualization here.” So that …
>> Roger Barga: Yes.
>>: … you know, this SharePoint web interface becomes sort of this scientific gateway.
>> Roger Barga: We are, and in fact, and we draw a lot of inspiration—for those of you who are familiar
with the myExperiment project, which I think pioneered the way of sharing DAG’s, sharing workflows—
combined with, basically, how could I annotate? For example, how could I annotate a model and say,
“I’m stuck here,” or this … “I’ve cleaned the dataset in this way.” Think sticky notes you could apply to
datasets to keep context. You’ve logged off for the night and gone to bed; somebody else gets up; they
log into the workspace; they see not only what you changed, but also the see context, the notes you’re
leaving behind them. We’re even stepping back and thinking about a social site we could add, so people
can actually just blog and ask a question, like “How the heck do you use this dataset or this model?”
And other people that aren’t even in your workspace and seeing what you’re doing and sharing can
actually ask questions as well. ‘Cause we think it’s new enough—we think there’s enough people
interested—but it’s new enough that people need a large support network. And so if we could light that
up around this service, we think that would actually change things. Yes, Bill?
>> Bill: What happen … the number of boxes that you need I can imagine sort of converging
downstream of the pipeline. If the number of different algorithms out there doesn’t make it all that
fast, you can get a nice library of ‘em and get a nice [indiscernible] provision. But—sort of—the
upstream stuff—I mean, the data cleaning, and feature engineering, and ingest kind of phase …
>> Roger Barga: Yes.
>> Bill: … seem like each problem might need a new box to be …
>> Roger Barga: Yep.
>> Bill: … created. So—you know—an SDK plus C sharp, I guess, is where … the sort of approach right
now. Is there room to sort of make it easier to create new boxes, or is there different support than just
what visual studio offers?
>> Roger Barga: Yeah, that’s a good question. So some of the stuff we’re working on—again, I don’t …
we’ll see if it’s still up—is, basically, apply R. So I have an arbitr … I can bring up an arbitrary R box at any
point in time, ‘cause you’re right: feature creation, and transformation, and data munching is something
you’re never gonna put the genie in the bottle—you’ve really got to let people use Python, R, and write
their own script, but then say, “Now that’s reusable by everybody.” In fact, even have a drop-down
menu of things that they can do, and we log them. Now, that becomes the black box if you run a data …
another dataset, a redo log applies to it. There’s something called Monaco Tools, which is an internal
prototype we’re doing, which is kind of a pop-up C sharp, F sharp—you know—interactive; you can type
your code in, so if you’re more proficient in one of those, we’d like to pop one of those up and allow you
to change your data that way, but then save that module so other people could use it. I’d like to keep
people from hope … having to open visual studio—it’s scary for a lot of people—I don’t even like it
anymore after having played with tools like this for years. Please.
>>: How do you see something like this playing into, like, reproducible research? I mean, can … you can
take snapshots of that, and save the data away, and share that in sort of like a myExperiment kind of
way.
>> Roger Barga: Yup. So one of the things for reproducible—and again, my service is unhappy right now
for some reason … okay—is that basically, I can take any one of these models, and if I decide that it’s the
gold standard on which my lab is gonna based, I can lock it, clone it. I actually can go back and—in
runs—and actually do lineage tracking for what people did at each step, and guess what? We’ve
persisted the data at any one of these pins right here, so if you wanted to say, “Well, exactly what were
they looking at at that point? What prompted …?” So now, this comes at a cost; so we allow scientists
to turn it on or off, or researchers to turn this saving off or throw the dataset away, but I want to give
you the complete ability to go back for three months of research and say, “Dang it, I’m gonna fork off of
that,” or “Now I can see exactly what they did to the raw data, how they ran their experiments, their
thought process,” to the point where you can actually publish it and sh … and give that to somebody
else—the whole model. So I see that my handler appears ready for our next speaker. So thank you very
much for your time and attention. [applause]
>> Wenming Ye: So thanks, Roger. That was our cloud machine learning group. Roger’s from our cloud
machine learning group, and he’s the principal manager over at the cloud. Neck … our next speaker
here is doctor Marty Humphrey, and we’ve been working together for quite a long time with the Azure
for Research program. Doctor Humphrey here has been experimenting with one of our tools that we
built here at Microsoft Research, called the FetchClimate, and he’s gonna talk to us about how you do a
large-scale visualization on environmental dataset. And Marty, it’s over to yours.
>> Marty Humphrey: So I was—thank you, Wenming—I was just checking to make sure that the system
was responding, so I could actually give the summary right now. So in summary, this is what we built.
[laughter] [applause] Okay, fine, I’ll give you the talk. [laughter] I‘ll return to that. This is … we’ve had
the pleasure of many years of working with Youngryel—he was a Berkeley; he’s at Seoul National
University now. He’s largely the driver of the science; I’m a computer scientist, and so I tend to not
know what’s going on, but I’m used to that, so … so to give you a little history of the project, we’ve been
doing this since 2009—a fair number of you have seen various people talk about this, so I’ll zip through
the beginning material.
It started with my student going to Microsoft Research for a summer internship; he met various people
at Microsoft Research, notably Catharine—who was driving a lot of the science behind this—and Dennis
Baldocchi from Berkeley as well. What they were trying to do was do large-scale environmental analysis
based on satellite imagery. The satellite was MODIS—you know, if you have any questions about
MODIS, we have a … the expert in the audience, Geoff, so ask him anything you want. Essentially, these
things are spinning around the Earth producing imagery, and there was a desire to combine the different
products into some sort of analysis of what was going on with regard to the Earth. Youngryel was
particularly interested in evapotranspiration, which is shown here. For years, I referred to it as the
breathing of the Earth, until one of my students actually corrected me and said it’s more like the
sweating of the Earth—which I liked; I don’t know if it’s true. Geoff can correct me later.
>> Geoff: Yeah, that’s good. Yeah.
>> Marty Humphrey: Sweating? Okay, so [laughter] the nature of the problem statement was: can we
do some large-scale based on the satellite imagery and focus on evapotranspiration among other things
as well? So the nature of the problem statement was: we’re gonna have these satellites that are
spinning around the Earth producing a whole bunch of pictures, essentially, and we’re going to divide
the Earth into a grid of horizontal and vertical, and we’re gonna try to focus on analyzing what’s
happening on each of the regions of the Earth—and this is very much a … lots of data, data parallel. And
so the emphasis in 2009 was: can we actually use the new Microsoft cloud as the basis for doing the
large-scale data analysis? So the group, that I was only briefly involved with at the time, created a data
pipeline and analysis system inside the Azure—burgeoning Azure system at that point—and the nature
of the analysis was to download a whole bunch of raw data files, primarily from NASA, combine them
inside Azure in a two-phase analysis. The first one was sort of data preparation, which had had to be
reprojected onto a different gridding system, and then the second phase was a large-scale analysis of
combining the different products from the satellites and to create some sort of analysis of what was
going on, again, at Earth-scale essentially.
And so behind the scenes, we architected it to … initially to be running entirely in Azure. We since
changed that a little to have it be able to run within our enterprise as well and also burst onto Azure as
necessary. So we re-architected it at some point—basically our head node was in Azure, and we had the
ability to come back onto our enterprise cluster or also to burst onto Azure itself when we had the need
to do more analysis than we had physical machines within the enterprise to do. And of course, we had a
rudimentary web interface that you could submit your analysis jobs, and they’d land on the various
machines in the back-end. This is the picture from the head node that’s running inside Azure.
So at this point—couple years ago—the system was working; we were doing a reasonable large-scale
analysis—a lot of hand-holding of watching the software, making sure that it ran okay—but basically, we
were producing the data we need. This is from Youngryel in particular; he produced this visualization—
again, this was a couple years ago—and the real problem at this point was we had to suck all the data
outside … out of Azure and bring it down onto the desktop, and then create the visualization via
MATLAB. And this was … the decoupling of the data generation and having to pull it down from the
cloud was kind of a real pain, and so we tried to look at how we could make it more end-to-end. And so
around the time Youngryel, in particular, was approached by Microsoft Research, saying they have this
really interesting new tool called FetchClimate, and could Youngryel try to integrate his data generation
with this new tool. And so that was the real challenge that I’m really trying to talk about today: how we
could take that data and instead of pulling it out of the cloud and doing it on our … the visualization on
our laptop ourselves, can we do it as a service and keep everything within the cloud itself? And there
was a lot of computer science, part … computer science issues on how to get this all working.
So this is the—a little bit jumping ahead, and then I’ll come back into the gory details—but this is the
interface that the user is presented with. This is a service that’s running inside the cloud; we stood it up
at MODIS FC2; it’s live now—this is a snapshot of the system I was showing a moment ago. It has a
series of dialogues that you can go through to, in effect, tell it what data you want to visualize, when you
want to … over what period of time, the region of the Earth, and then, essentially, you ask it to go and
produce the data. So this is the “when” dialogue, and at this point, we would specify that we want to
visualize this particular region. I jumped a little too quickly, but essentially, what I want to do is: from
2012, I want to do a monthly average of this particular region. And then, you’d hit the “go” button, and
it produced a visualization that I was showing a moment ago, which again, I’ll return to in a second.
I wanted to jump into some of the nice features of Azure that we exploit to be able to do some of this.
This is the console—I’m sure many have seen this kind of console before if they’ve run stuff in Azure—
this is the console that we use to control the primary service that’s running inside Azure, that’s doing the
data aggregation and visualization. And there’s a particular aspect of the UI which is really nice that
we’ve been using—which is both good and bad—is the ability to manually scale it in response to the
anticipated load that we’re going to be generating. It’s a really nice feature to be able to do this
dynamically. And so for example, this is just showing how we’re setting the number of cores that are
running on the data analysis—and again, we would set this if we had a lot of visualization that we
wanted to do, and then we could crank it down as well when are in a relatively more idle period.
So I wanted to show you a couple snapshots of it sort of at work. This is the region for all of 2012 … this
is a particular region that we’re looking at—and again, this is the user interface from FetchClimate—and
so we’re really happy that the system is working right now—we have it integrated with our particular
data analysis—and so down here, you can select the different month that you want to see. And as you
can tell, in January, this is trying to show evapotranspiration; the visualization is, in effect, saying in
January, the Earth was not really sweating. And so if you moved it out into June, you can see the Earth is
visually starting to sweat a lot more. We think the particular—this is somewhat obvious to this
particular community—but the real power of this is to easily and visually be able to determine how the
Earth is changing; we … this is a particular year—this is 2012—but we’ve done some analysis—the data
was being generated by MODIS satellites essentially around the year 2000. And so we even want to look
at the short time pan … short time period of the year 2000 to the year 2012, and visually, can you see
the Earth changing? And being able to visually assess this is really particularly good for, say, the less
scientific amongst us. I guess that’s a politically reasonable way to say that.
The system has the ability to deal at different granularities, and I just wanted to show you how you can
dive in with different granularities, controlled by the number of cells in the x and y region, and I just
wanted to show you visually how that would change. And so this would be the very course-grained
analysis—the obvious negative of it is how course the visualization is, but this would come back fairly
quickly—and as you want to get a finer granularity, of course the queries would take longer, and the
service would be pulling in more data and do day … more data analysis, but I wanted to show you how
deep we could get in with regard to granularity, and I think I have a couple more. So this is show a really
fine-grain analysis of what’s going on. Our raw data is at the granularity of one kilometer; I’m not sure
what the granularity here is. And then, a couple more.
So some more on the architectural details of what we need to do to run inside Azure to get the
visualization working: the final system that we had running—that’s part of what’s shown here—is partly
development and partly production. So we needed three virtual machines and up to two hundred cores
for dealing with the data processing on the front-end. We needed seven cloud services, three SQL
databases, and ultimately—the data that we’re looking at now is all of 2012 for each of those regions—
and so we have about two point three terabytes in Azure right now, just sort of waiting to be sucked in
and visualized.
I wanted to show, briefly, a little more of the architectural details within FetchClimate. Here, I make the
snide comment about the color, but we’ll just cut some slack to the Moscow State people. But this is
the architectural details of what’s going on with regard to FecthClimate, and essentially, the takeaway
here is a fairly complicated system—a lot of moving parts. So I … the nature … the title of my talk was
“Experiences,” so this is where I start getting into the more negative aspects of what we had to
encounter. At first, I decided to call them hiccups in the whole development process, but then I
realized: these weren’t hiccups, these were actually research opportunities. [laughter] So here, they’re
research opportunities, not hiccups.
As you can imagine, the theme of the difficulties in integrating the systems was essentially all because of
scale. Things go wrong. The actual science that we were doing were very data-dependent; we were
running this—the analysis—very, very many times; the raw data … each computation required a
different collection of input data. We designed the application to be—you know—resilient to missing
data, but then on a … on each particular invocation, it was actually … this was … a negative aspect of
what was going on was that the application was completing, but it was difficult at times to determine if
it was … the science output was actually wrong or if it was in fact performing well because of the lack of
input data. And so at scale, determining if a particular computation is scientifically valid can be very
difficult. A little … not as interesting—arguably—is: when you’re doing so many things inside the cloud,
the data movement can get … you can lose track of what’s being held where and what’s being replicated
by mistake, and so we had many, many files, and we’re not sure where they were at particular times.
Small times large equals large, especially economically, but in particular, these last two were really sort
of indicative of what was going on.
We had a particular analysis code that we sort of designed to be correct, and then, we ultimately ran it,
for this particular computation, about seventy thousand times, and it was very difficult at times to track
that and figure out what was going on. This particular application was run seventy thousand times—we
had a collection of auxiliary programs as well, and so the cluster that we had inside of Azure that was
doing all the data analysis: it was just a normal Windows head node. Ultimately, to do the analysis, we
had five hundred and eighteen thousand jobs just to do that, and sort of keeping track of five hundred
and eighteen thousand jobs was not the easiest thing in the world to do. The head node kept track of
them, but you know, us—the humans—that was a little more difficult. And then another … this sort of
falls into the category of: be careful what you wish for, because we had a visualization system; we
hooked up all the plumbing; and it was basically working—we were able to visualize the stuff that we
couldn’t do before—and so that was very nice. And so we just said to ourselves, “Hey, it’s working.
Clearly, we can just throw as much data as we want at it, because it’ll just keep working.” And so we got
into this phase of just throwing more data at it, and then we sort of said to ourselves, “What do you
mean? How can we not visualize the entire Earth at one-kilometer in less than five minutes?” Those …
that we just sort of a … it went from prototype, small-scale, working really nicely, and then we just sort
of made the huge leap into: okay, let’s do the Earth, and what’s going wrong?
So we have a number of things that we want to address and what we’re starting to do now. And there’s
a theme here of what we need to do next to sort of really get it working is to make this huge system—
which is really working nicely, but perhaps unpredictably with regard to duration—we want to sort of
figure out what’s going on and be able to predict how long certain things were … are gonna take. So
there’s a whole category of interactive dialogues we want to create into the system that are essentially
… the theme of it is: do you want to continue? And so for example, we want to facilitate the scientist
being prompted with—you know—you’re about to request us to do something that’ll take two hours;
here’s something that we could generate for you in a minute that’s very close; is this something that
could satisfy you while we go off and do the two-hour computation? Because at scale, sometimes this
really takes that long. So matching the intentions of the query to the system with ones we’ve already
done would be a nice thing to do. We also want to be able to predict how long certain things, in
general, take. That computation that I was checking—I’ll put up at the very end—that computation I
was checking was in part because I did not know how long it was gonna take, and so I had to see
whether it was actually gonna come up in time, by the talk. And so we want to be able to compute how
long certain things will take prior to their being executed.
We also want to connect the visualization back to the data analysis in the first place. Notice we had
these two decoupled systems, where we did a whole bunch of data analysis, and then we couldn’t
visualize, and then we could visualize. So we connected everything, and so we want to be able to drive
from the visualization. In effect, the user wants to look at this particular data, and we want to be able
to, at that time, come back to our dataset and say, “Well, we don’t have that, and so if you want to
really do that it’s gonna take us about four hours to go back into the raw data.” We also want to do this,
because I’m particularly … I’m a particularly well-known cheapskate, I guess. If you’re willing to spend a
certain amount of money, then we can do it much quicker—we want to fully connect the economics of
the cloud with the scientific aspects. Again, perhaps I’m just a cheapskate, but I’m really interested in
doing that.
And so the summary of the system and the talk is that partnerships between the domain sciences and
computer science I put up here as both challenging and awesome. Youngryel is doing all the science; we
have all the knowledge—or we pretend to have all the knowledge—of what’s happening inside the
cloud, and making the connection between the two is particularly … it’s a good time for this—Roger was
talking a little bit about this. It’s a particularly good time; things are really cool right now; we can really
hook things together. We tried to do this in both the visualization system and the data analysis system.
And I guess lastly, oddly, we’ve found that the laws of physics still apply at infinite capacity or scale. We
… I guess we could have predicted that, but we went from very small-scale visualization to very large,
and we sort of thought it was gonna work, and the laws of physics were getting in the way. Trying to get
our heads around that and presenting that as an … as useful information to the end user might be
something that we need to do next.
And so I guess lastly, I’ll just come back to the running system to show you that it’s out there now.
Again, this is a fairly detailed granularity; we can dive in a little more, but this is the live system right
now—this is Internet Explorer, and that’s the URL that it’s sitting at right now. And again, you can sort
of visualize—this is 2012—what’s happening with the Earth in … as a result of the analysis of the MODIS
satellites around this time. So thank you for your attention; I’d be happy to answer any questions you
might have. [applause]
>> Wenming Ye: Any questions for Marty?
>> Marty Humphrey: Geoff … not Geoff, no. [laughter]
>> Geoff: So I … what other datasets go into doing the calculation besides just the MODIS images?
>> Marty Humphrey: I’m anticipating that you don’t want me to say none. [laughter]
>> Wenming Ye: Marty, can you repeat the question please?
>> Marty Humphrey: Geoff is asking me about what a … a very good question that, unfortunately, I
don’t know. The very good question is: the analysis that is used to … the analysis that is shown visually
here—I’ve presented it as the analysis is only using the MODIS satellite data as input, and I … so Geoff
was asking me what other data is used as the input, and I don’t think any, but unfortunately, I think
Geoff thinks that there must be something else. And he’s, of course, probably right.
>> Geoff: Well, but you would …
>> Marty Humphrey: I know what data’s being sucked from what FTP servers, and I know what data’s
on the system when the analysis job runs—I think it’s all from the satellites.
>>: [inaudible] distance, yeah.
>> Kristin Tolle: What I’d like to … is this on? Can you … is this …
>> Roger Barga: It might be a broken one.
>> Kristin Tolle: Oh, okay. Well, here …
>> Roger Barga: Here, use mine.
>> Kristin Tolle: So I’m gonna hop in here and help you answer that question, Marty. [laughs]
>> Marty Humphrey: I appreciate that.
>> Kristin Tolle: So …
>> Wenming Ye: Introduce you fir … self first
>> Kristin Tolle: Oh I’m sorry. I’m Kristin Tolle; I’m with Microsoft Research, and actually, I just released
FetchClimate last week to the public at http FetchClimate—all one word—dot o-r-g. So if you want to
test this out yourselves, you can actually use your own instance. The instance that Marty’s showing
here is one that they’ve deployed locally on their own Azure service. You can also go out to our website
and you can download a copy and deploy one—just as Marty has done—because in the release that we
did last week, there’s also a deployment system for you to get from that site. Inside the standard web
service, there is … there are dozens of different datasets that actually are available to you, using the web
service version, but if you want to have your own private instance, you have to load up your own data,
which is what, essentially, this project is doing—which is why MODIS is the only one in this instance—
but if you want to use any of the other systems, like NCAR, you can actually go out and use the
FetchClimate dot org one and see what’s available to you there. Thank you. Sorry.
>>: You’ve downloaded image for satellite; is that right?
>> Marty Humphrey: The MODIS raw data is stored on a couple different NASA FTP servers. We pull
that into Azure.
>>: Do you have any idea what’s the delay for the data acquired to the data available? [laughter]
>> Geoff: The … there’s a system with the MODIS data that you can get it within about an hour.
>> Marty Humphrey: Is that right?
>> Geoff: Yeah.
>> Marty Humphrey: Interesting.
>>: Though, because you’re … yeah … the cloud cover and things like that, you have to … depends on
what you’re looking for.
>> Geoff: Right. But the … but there is a mode of—it’s through what’s called the LANCE system—that if
you’re really interested in real-time …
>> Marty Humphrey: But within an hour …
>> Geoff: … sort of acquisitions, you can get it within about an hour…
>> Marty Humphrey: Wow.
>> Geoff: … acquisition. Now, when you’re getting it from the NASA archive, it’s typically a couple of
days, but … so it really … so the answer really depends on whether you’re trying to do something in real
time, which you’re not. But if you really want to do something—like, if you want to know what the
Earth … what the rate is at which the Earth is sweating now, [laughter] then you can do that.
>> Marty Humphrey: You don’t have to go down to my level of discourse but, you know …
>> Geoff: Right.
>> Marty Humphrey: … I appreciate that.
>> Geoff: Yeah.
>> Marty Humphrey: You don’t have to refer to the Earth sweating if you don’t want.
>> Geoff: It’s fine. It’s a good term. [laughter]
>> Wenming Ye: We have another question from Scott?
>> Scott: Yup. In your workflow, you had a switching between the HPC and the cloud … bursting out to
the cloud. Was it a local HPC that you had?
>> Marty Humphrey: It was, it was.
>> Scott: You had the head node in the cloud?
>> Marty Humphrey: Right, so our first creation of it was ad hoc, new Azure mechanisms—this was
2009. About a year after that, we said, “Hey, a lot of what we’re doing manually we could get the
Windows HPC cluster software to do,” and so we re-architected it for that. We ran that locally within
our enterprise, and then the fine folk at Microsoft created … released a new version of that software
that would be allowed to run directly in Azure. And so at that point, we just started running the head
node within Azure, and then we would sometimes call back into our enterprise resources.
>> Scott: So is it separate or is the head node actually calling the local HPC cluster?
>> Marty Humphrey: We have different variations of it …
>> Scott: Or is it parallel?
>> Marty Humphrey: We have different variations of it, and our most recent one is: the head node is
sitting in Azure and it’s just using compute nodes in Azure, so we really have no need to come back to
our rapidly-aging hardware within our enterprise.
>> Scott: Oh yeah.
>> Wenming Ye: We have another question from Tanya?
>> Tanya: Can you put the slide back about time query run … query runtime estimation? No, one before
that. Yeah. So …
>> Marty Humphrey: Right.
>> Tanya: … how generalizable is that to other domains, because I—as … from the previous talks, data
science is a big part, and scaling up is a big part of science, and so estimating run … query runtimes I
expect is gonna be a big issue, and giving that kind of feedback to scientists. So how do you do that?
Can you talk a little bit of how you do that now?
>> Marty Humphrey: I agree, generalizing duration of scientific experiments at scale is insanely difficult
right now. How we particularly … the only way we can do it, or the only way we can attempt to do it is:
given a detailed understanding of the domain and the data—the raw data that’s being stored—we’ll be
able to figure out how many blocks are retrieved from Azure storage, where they are, the resource
selection policy within Azure, that sort of stuff—very detailed, admittedly very finite analysis. I agree,
it’d be very nice to be able to generalize this, but …
>> Tanya: Computational complexity on the cloud, you know?
>> Marty Humphrey: Yeah, I’m … again, there’s a very general theme that there’s so much capacity that
you can sort of do stuff, and now you want … actually want to predict how long it’s gonna take, and
that’s incredibly difficult right now.
>> Wenming Ye: Okay. Thanks, Marty. [applause]
Download