>>: A lot of the [indiscernible] >> Wenming Ye: Well good morning. Welcome back. >>: A lot of the [indiscernible] >> Wenming Ye: My name is Wenming Ye. I am a senior research program manager, and I am part of the team here at the Azure for Research program. So the first session here is on analytics—analysis and simulation—and I have a personal interest on HPC, as well as big data. So let me introduce three great speakers here today, and the first here is mister … doctor Roger Barga, and he is a senior manager here on one of the machine learning teams here, and he was formerly a Microsoft Research member. >> Roger Barga: ‘Kay, thank you. Alright, so hopefully the microphone is working now? >> Wenming Ye: Yeah, awesome. >> Roger Barga: And if the screen will come up? So I think everybody here has a strong interest in escience, and it’s something I share as well. And in fact, it’s kind of interesting to look at, and what I’ll be talking about, really, is the intersection of e-science, data science, and machine learning. And it’s interesting, if you look at these companies that call themselves data science companies, to say, “Hey, where do these people come from? Who were their first hires?” DJ Patil—who basically built the LinkedIn data science team and has written several books on that—he came from National Labs, and so many of the actual, very first people in data companies—at Facebook, Microsoft, and other companies—come from National Labs, because what we teach in e-science—and what you teach in science is how to get inside out of data, how to manage data, large amounts of data, make sense of it, wrangle it into some shape or form—is exactly what you need for data science. And here, you see something interesting happen in the data science community: we’re formalizing curriculum. I helped build a curriculum for data science in collaboration with Bill Howe back there for University of Washington. I’ve since had a chance to talk to other universities about how to build data science curriculums, and you look at the rigor—and in fact, at University of Dundee, I talked with them as well about how to build a data science curriculum, and they’re actually spawning it off of their e-science group—and it’s interesting here to say, “What can the e-science community—look back—and what trends can you tap into and tools can you tap into that are rising in data science and pull that back into the sciences so maybe there’s a virtual cycle going on here?” So my talk’s a little bit about that. What I wanted to share is a little bit of about a Microsoft perspective, and in particular, about machine learning, ‘cause there is a silent—it depends on what size of the company and where you’re working—but there’s a silent revolution going on in computer science around machine learning. And in fact, we’ll argue—and Gartner, Forest, and others who are tracking this—to say it’s going to be one of the most impactful technologies we have; it’s going to … I mean, it’s going to have the most impact on IT and systems over the next ten years—machines that actually learn from data. And we’ve got lots of that. Just to frame it—just to step back a little bit and say, “What motivates this?” You know, machine learning: one of the first, most successful applications of machine learning—it’s kind of a preamble we put in a lot of our talks—goes back to, basically, handwriting recognition and the post office. And the very first systems that would do this—or optical character recognition—there was basically heurist … they’re rule-based systems: imperative, declarative code that would actually look for the letter, actually look at a rule base, look at the features, and of course, every variation that was thrown at the system and that broke it, they opened it up. A bug request and a feature … a developer had to write more code—clearly, this was not sustainable in the long term, but that’s how, basically, we knew how to build systems, to the best of our knowledge. And it wasn’t until, basically—I would say—the late eighties, early nineties that they started to treat this as a machine learning project problem. And in fact, Darp actually had some funding to actually fund people to do work in this area. They said, “Hey, let’s actually take a machine learning approach. Let’s just take every instance of the letter—we’ll build a training set for it.” Because again, machine learning systems can start out real crappy, but they get better with experience. And they … this creates the training set or the experience that this machine learning system is going to learn from. So we’ll take every known example of this letter that we can find; we’ll do various techniques to rotate and add noise; and of course, the labels will stay the same—they’re all just variations of a two—we’ll then feed this into a machine learning system; and voila, out comes a pretty okay—you know—system. We didn’t have to write a line of code; we treated this as a data problem; and we used machine learning to understand the correlations this signals inside of the data. And this is the systems that are deployed today. And in fact, it fails; what happens if a letter gets kicked out? Well, somebody actually labels it, puts it back in the system, and cranks … turns the crank again, and the system just gets smarter. So mistakes actually make the system smarter and improve the performance. In fact, a modern-day implementation of this is the Bing language translator that you can download off the Windows Azure store. And no matter where you’re travelling, you basically … you can take a snapshot of a menu, and it’s going to translate it into the language that you choose—all through machine learning. Again, no lines of declarative or imperative code written; it was treated as a machine learning. Now a more modern implementation that’s something you just could not have done by writing code is, basically, the Microsoft Connect. Now obviously, this has been a great commercial success, but if you step back and looked at the technology advance that Microsoft Research had to contribute to this to make it happen: basically, think about all the different body postures—all the different body shapes— that have to be recognized—I mean, this is a really, really diverse dataset—different ages of … different obstructions, pulling the background from the foreground, people overlapping each other. This whole implementation for the Microsoft Connect sensor on the Xbox was treated as a machine learning problem, where they capture training sample from an individual wearing a suit like this. Close to eight hundred million were actually captured, and even that wasn’t enough to get the performance out of the machine learning system that was needed for the Connect sensor. So they actually used simulation and other machine learning extensions to create a larger training set—just like we did with letters and handwriting recognition—until we had close to two billion training samples which were fed into a machine learning service to then actually create the Connect sensor. And that’s actually just a subset of the information that comes off the Connect sensor. It’s really striking how much information it’s pulling: how hot you are, whether how hot … how your blood pressure has gone up, you’re perspiring, so they can actually gauge how many calories you might … predict how many calories you might be burning. Again, it was all treated as a machine learning exercise. So as we think about some of the problems we face in science, where we’re awash in data, and the chance to perhaps label that data and say, “This person has cancer; this person is developing this.” If we get enough data, could we treat these as machine learning problems? Which is a very compelling value prop. So I think—you know—machine learning allows us to solve extremely hard problems better, extract more value from big data, and really drive a shift in engineering culture. You don’t have to get it right once, but how do you set up a capture cycle, retrain cycle, and a loop—a feedback loop for your system—so that you can actually start building that corpus of training data that you need? But let me talk a little bit more and transition a little bit into data science, and talk about a more recent problem we had to face. And this was, basically, hosted exchange for Office 365 in our Dublin datacenter. Once again, even Microsoft waded in with basically a handwritten rule … set of a rule … about five hundred rules that monitored the service … monitored the servers, the disk drives, the CPU, the network, and the operators promptly turned the alert system off, ‘cause it was wrong most of the time, and raising false positives, and actually missing the really bad stuff that brought us down for periods of time. And so what they turned it into is basically a data capture, where they captured time series data coming from each and every one of the sensors, and when an error occurred, all the operator had to do was basically—since they knew exactly what went pair-shaped—label it, press a button, tell us what happened. We’ll capture all this data; we’ll send it off to a machine learning group; and basically, from hundreds of thousands of machines, hundreds of metrics figuring out which signals actually correlated. And actually built an early warning system out of … the previous system’s been completely discarded, and the machine learning system provides the dashboard, and when new events that were not anticipated occur, we actually build a new training set for it. But that’s really not the whole story. There’s really something going on here behind the scenes, and that’s about data science. If you look at all the applications that I’ve just described to you and some of the more modern applications across the company, there hasn’t been really—with the exception of deep neural networks—any real advancements in machine learning. They’re the same algorithms. Yes, we have lots of data; yes, we have scale; but what’s really coming into play here—‘cause again, this is kind of the usual suspects—we actually … there’s a survey of twenty-five hundred predictive and machine learning applications that have been deployed in the industry, and the usual suspects of the top eight algorithms are surfaced in most of them. So it’s not like the machine learning technology’s changing, but what is happening: because of more data and our ability to pull better features, in the right hands, a very simple algorithm can actually have incredibly predictive power. This is where data scientists comes in; this is where somebody who can actually make sense of the data, figure out how to do—we didn’t actually capture each and every point, we used time series, smoothing; we created features from this: high, low for a ten-minute window to create a good set of features, so even the simplest linear algorithms can have a chance at actually giving a prediction that a … that the system is going pair-shaped. This is the role of the data scientist is what I’m trying to get at. This is not something new, by the way. Again, I don’t know how many of you have tracked this, but basically, data science: probably the first real mention of it was Bill Cleveland back in Bell Labs—he actually wrote a data science action plan back in 2001—and he was proposing a curriculum program to get it … this woven into academic institutions. JISC did a report in 2008, NSF had a report as well in 2005 all talking about data science, primarily in the context of the sciences, not business. It’s just business has been the one that’s picked it up and embraced it. And there are a couple of journals, so this is not … this data science thing is not really new. It’s just been coined in the last few years—the term and the buzz— but this has actually been talked about, thought about, and in fact, Bill Cleveland’s original plan is actually still one to follow in terms of how to build a curriculum program for data science. But there’s something going on here. I mean, I want to talk about the complexities that these people face and think about tools we should build to support this. You know, to go … it’s a very different workflow—it’s not BI—you have to kind of define an objective function. Am I trying to predict failures? Am I trying to predict churn? And what’s a proxy in the data for that prediction? ‘Cause guess what? There’s probably not going to be a column saying, “I’m going to fail” coming out of your machine. You … there’s probably a number of columns that you’re looking at, which are indicators of the health of the machine—similar for patients or similar for anything you’re analyzing. You have to get smart at looking at your data, saying, “What’s the objective? What’s a proxy in my data that I’m really trying to predict?” Then gets … get lots of data, and you have to realize you have to throw some data out; you have to massage it and create features; you have to sample, create moving windows—I mean, you have to bring that data down to something that the machine learning algorithm can make sense of. We spend a fair bit of time talking about that in our data science program: how to … what are the common patterns that one sees? And then finally, frame it as an ML problem. Is … are we gonna treat it as a classification, a binary or a multi-class, a regression, a ranking? Are we just gonna do time series forecasting and maybe some combination of it? This is what data scientists have to do; they have to be able to shape a problem from raw data into something which is tractable. Finally, with a machine learning algorithm, and build modelling dataset—may be wrong—and it’s a highly iterative process. So once you’ve got the data, and you build a training set—be it for the handwriting recognition or a server about to fail—you do some feature selection, which means you’ve gone from a big number of columns down to something relatively small, and you train a model. And then you say, “Okay, well how’d that model do?” You hold it … show it some data it’s never seen before and see if its predictive accuracy is high. Evaluate that, and—you know—maybe you have to go all the way back and redefine the objective function or go get more data. This is a highly iterative process today—you know, a lot of experimentation. In sum, if you step right back, what we’re really teaching people is how to weave experimentation and data together, and get very used to running experiments and having a hypothesis over data. And again, as we’ve talked to data scientists, they have their favorite algorithms; they like to choose between them; no single one is good enough; and in fact, a single algorithm in the right hands can probably do it all for an individual who knows how to make that algorithm jump through hoops, but you talk to a different group of data scientists, and they’re gonna want different techniques. So you really have to give them a variety of ML algorithms. And then finally, it’s about experimentation. Just because you have a good dataset that predicts well—well, you know what? If you chop the dataset slightly different into a different in-training testing set, the results might go pair-shaped, but they might be just as good, and so this is really about running fast experimentation and trying out new ideas really fast, over and over again. These are the elements of data science: a very distinct workflow, having an intuition about what carries signal in the data and pulling that out, trying it out, trying different ML algorithms, and running an experiment very fast—actually see if it actually has predictive capability. And for those enterprises who are looking at this—and I’d even argue that the labs who are looking at this—are going, “Holy cow. First off, we have a shortage of people who know how to do this.” And we as Microsoft are looking at this and saying, “Hey, this is just too complex today. We need to think about how to make this simpler.” I know some of you in the research—I’ve seen a few research proposals— are thinking about how to make this simpler and build tooling. So what I’d like to show you today is actually something we’ve been working on—because again, I’d really like to get my e-scientists or my data scientists out of software development; if I can minimize how much code they have to write, I can increase their productivity. I need to give them tools for data exploration—that’s how you get the insight: “Oh, that feature: that’s an interesting feature; it’s actually correlated with the target I’m trying to predict. That’s good; I’m gonna try it.” You know, I want to make that available, but you should have to learn five different stats packages, ‘kay? Machine learning: if you talk to people who really do this for a living, it takes two or three packages to have the right algorithms to have that breadth—I’d really like to make that all in one place—and then, good experiment and data management—if we could actually think about services to provide that, then I think we could actually put this in the reach of mere mortals who understand the science in the problem and can actually go through and do that. So let me show you something we’re working on, just to give you a glimpse and maybe some inspiration. Make this really quick. So if you have a web browser, one of the first things we’re trying to say is you shouldn’t have to install any software; you should be able to go to a cloud service, where your data can be uploaded, and you can have scalable storage—and I could have used chrome or anything else. And the first I land in is … this is a repository for all my models; it’s the … it’s my project space that I’m working on. And I actually may be working on multiple project spaces, and I can choose between them and go over to my colleague’s, who’s invited me to go to help him solve a problem. So if I’ve got two or three projects I’m working on, I should be able to actually snap between them fairly seamlessly. And in fact, if I decided that there’s another PI I want to invite, I should be able to go over here and say, “You know, I’d like to actually take someone and make an invitation,” and actually take your e-mail address, type it in here, and say, “Okay, join me.” And you’ll get an e-mail in your box, and with your browser, you could join me here, and together we could start working together on the modelling. Because really, data science and e-science is really about collaboration and pulling the right people in. But let me go through, and I realize I’m somewhat time-constrained, so in the interest of time, let me show you just how fast we can build a model together. We’ll say … may … if I can get a dataset up there, I can say new dataset, I can put it in from my local file system—‘cause maybe someone’s just given me a sample of data—or I say I want to do a new model, start one from scratch—we call them experiments, ‘cause again, data science is about experimentation. If I wanted to read data from Azure storage, I don’t have to know about how to talk to Azure storage; I can just write my credentials in here and write the query. If I wanted to read from Hive, I can type in … just select Hive, type my Hive query in here, and read data from Hive. If I wanted to do it from SQL, I could just cut and paste my SQL query here, and this thing will handshake and run a protocol with SQL to pull SQL into it. That just takes a lot of programming off my plate. Or I could do something like just—here’s a dataset that I’ve got up here that I’ve already uploaded—do something really simple. And if I wanted to look at that dataset … >>: Question … >> Roger Barga: Yes, please? >>: So are you downloading data into this environment from the cloud or …? >>: Roger Barga: Yes, yeah. >>: Okay. >> Roger Barga: So I can pull … I can even pull if from on prim, and if it’s a big dataset, I can run samples and pull a sample up here. As a data scientist, you want to give them access to all the data they sh … that they need for their job. And if I double-click on that—I’ve just pushed a dataset here—there’s this beautiful client-cloud connection; I can actually open a local client. And what we have here—we’re gonna keep this in the e-commerce business space, not the science space—I’ve got two years of whether somebody’s purchased a bike from me or not. And here’s all the attributes, and it could be whether or not somebody got sick or whether they turned out on my company. It all gets down into something I’m basically trying predict, and then features I have over that individual. And the real question is: can I build a model that has predictive capability? So let’s just go ahead and close the Excel sheet, and—you know—I might be really curious about: well, what’s in that dataset? Any missing values? Do I have any me … do I have mess to clean up? And notice that, basically, it just kind of tells me what I can connect up, and if I hit run, this is the first time I’m actually doing any work on the cloud. I send this up—I send that DAG up to the cloud service, which has a VM pool standing behind it—it’s done; it’s created html five; I can click on this … come on; and there are my columns that I was showing you: bike buyer, the columns, how many unique values they have. I can see that there’s no missing values … oop, there are a few missing values; I’m gonna have to clean some of those up. And so I can basically pull in missing values cleaner. So again, these are kind of things that you learn that you need to clean up. Go ahead and just run the dataset through this, and I can actually say place with a median. That’s good enough cleaning. And now, if I wanted to actually train a model, I’ve got to build a training set. And I can tell it what percentage I actually want to use for my training set, and if … for those in data science or machine learning, they know stratification’s sometimes a big deal; we’re actually gonna stratify based on some point in time or a column. So all my stratification procedures can be drawn through a drop-down window as well, which I won’t bother at this point—I’ll actually keep that false. But I will train predictive model; I will score it; I’ll send the training set to the trainer; I’ll choose an ML algorithm; I’ll try a very early one—a simple neuro-network; we’ll see if we can’t improve upon that in a second—and the trained model comes down here to the score; and now, there was thirty percent that this model’s never seen, and we’re gonna give it a pop-quiz—we’ll just run that down here. And this guy needs to know he needs help—he needs to know, basically, what column we’re trying to predict. So basically type bike buyer, and assuming the demo gods and our service team is … refreshed the service well overnight, I click run, this should go through and actually train a predictive model for me, and I should be able to see … in fact, one thing I’m gonna try to do here really quick … I didn’t do a metrics; I didn’t build the nice graphics, but let this run for a second. It’s training right now, splitting up the data. Now the thing about this DAG is: what I don’t know is maybe I pulled a Hadoop job in; maybe I’ve pulled a database query in. The runtime behind the scenes actually moves data across these different runtimes … there’s a question. >>: Is this all coming from a single UPM or across several nodes? >> Roger Barga: Pool of virtual machines. So we actually look for any parallelism we can exploit, and we exploit it. And in fact, we’ve got some scale-out Hadoop algorithms. I need to do a binary-class metrics evaluator—something that actually makes some nice graphics for me. So assuming this model … and if I hit run, a couple things to note: only the module that … only the things that change get re-executed, ‘cause it’s my experiment platform—I may just be … I may be working on trying to train one or two modules. And in fact, I won’t get into it now, but I could go all the way back to lineage tracking and track from the very first run. If I click on this, that’s my ROC curve, and so for those of you in machine learning, you know that that diagonal—this thing would have been guessing. So I’m doing a wee bit better than guessing, and this is my so-called confusion matrix; it shows me where I’m getting right things right and my type one and type two errors. And if I was really curious, I could actually click these and see the actual tuples that were … I’m getting wrong, or I may just want to try another idea—and this is about the experimentation. You know, I know that boosted decision trees are actually very popular and, actually, ver … can be very accurate. So let me just copy, paste a subset—I’m gonna replicate the pipeline. I don’t need … I’m not training a perceptron; I’m gonna train a boosted decision tree, and I’ve just shown you the ability to, basically, within a few mouse clicks, I can actually try another experiment, and hit run. This is the experimentation part; I should be able to try every ML algorithm that I know of within a few minutes on this dataset to see if I can’t improve the performance. And then, I’d start doing feature training— feature engineering—but I can basically run a hundred experiments in a day, and when you can put data scientists into that immediate zone, things start to happen. You start to get models and predictions built. And this is done. And like I … and we’ll see … almost done … come on, build that nice graphic. That’s a much, much better—in fact, a good approximation for how this model’s gonna before … performs like area under the curve—it’s eighty-six percent. This model has predictive capability, so if I’m trying to predict whether someone’s going to get sick, or whether this is a certain anomaly, the ability to have this in … and it’s highly interactive: I may want to run this model at different confidence levels, because maybe I’m okay … I don’t want any false positives; maybe I’m okay with a few false positives, and see how … if I can actually change in the confusion matrix change. If I’m actually sending fliers out to somebody and wasting a quarter, I’m okay with a few false positives, so I’ll go ahead and operate the model very aggressively. False positives goes up, but I don’t … false negatives go down; I don’t miss anybody. But if I’m about to tell somebody they’re getting sick, I may want to be very careful with the false positives, and really, really try to push this model down into a conservative zone and run these people through another test, just to make sure they’re not really sick. This is the kind of agility we’re trying to bring to data science. And what’s more is we … if you look at what we’re thinking about as a group—together with our friends in Microsoft Research Connections and elsewhere—what if this was a domain-specific library? What if the modules—the datasets that you can read naturally—are those that are specific to you domain? So if you want to pull a dataset into you domain, we have a reader for that, or we have a module for it. Or starter algorithms kind of represent a starting kit for your domain, whatever that might be. And how do you build templates so people can come in and have a workspace that’s being shared across an entire community? The datasets are already there, or there’s readers for those datasets, or readers from the web services that the community’s already stood up, and you can start building libraries of models that that community can share as a research community and kind of build up over time. So we’ve already got a couple of groups that are doing exactly that … that we have an SDK that allows you to basically build some of these modules yourself—that I’m showing you here—build one of these yourself to actually read from data sources that you’re are familiar with. Techniques: if you have, already, analytics algorithms—R packages you’re using—we allow you to actually wrap up your code and actually create reusable modules that other people can actually start using. So for those of you in workflow, it’s very much like workflow for machine learning and data science, all at cloud-scale. So that’s what I wanted to show and share, just for thoughts. And for those of you are really interested in predictive analytics—which is kind of something that I’m really interested in right now—I’ve brought you a reading list that you can have a look at and read some books—I think—which are great starter books. With that, I’ll stop and field any questions people might have. >>: What’s the status of that project? >> Roger Barga: So right now, it’s being in … used internally for a number of our internal projects. We’re actually doing machine analysis for our GFS datacenters, churn analysis for dynamics, and we’re thinking about how we can bring this forward, both for researchers and also for our customers as well. Yep. If we do in fact—and we are thinking very intention … I’m being very conservative about this, but maybe at faculty summit, we might have more to share—on how to both have free access to this for academics along with a curriculum around data science—slides that kind of show how to actually use this in a data science curriculum—and free access for academics to be able to use it to teach data science, but like I said, also take the SDK and really drive it home for vertical. ‘Cause as a cloud service, it can be available to researchers anywhere who don’t have any more than a PC or a tablet—sometimes, I give demos on an iPad and just do the connections and run the models; it’s kind of fun. >>: So like, if instead of … so suppose I wanted to predict one image … >> Roger Barga: Yes. >>: … from another. >> Roger Barga: Yeah. >>: I can adapt … >> Roger Barga: So GE … >>: Repeat the question. >>: [indiscernible] >> Roger Barga: Oh. So to repeat the question that … the first question was on availability, I … which I think the answer was self-explanatory about the question. The second one was basically about images, and we in fact have a company—one of our inter … our preview companies that’s previewing it for us externally—is actually taking two-dimensional images of tumors, and they’re trying to classify them. Now, the … what was the code? Where did they use our SDK? They actually will take those twodimensional images and turn it into a feature vector—that was their intellectual property—everything else, they just wanted box-standard ML, like most companies do. So they took their SDK—their code for actually looking at images, reading them in from blob storage—and converted into a feature vector that captured all the features of that image, and now they’re trying to actually … they’re just running large volumes of cancer images to get a model that will predict a cancerous tumor versus noncancerous by its shape, it attachments, and other features that they were able to get off of it. So yeah, very applicable. Yes, please? >>: Are you building collaborative features into that? So like, you share that workflow with me, and I want to come look at it, and we can sort of chat. I’m like, “Hey, you should really add in this other type of visualization here.” So that … >> Roger Barga: Yes. >>: … you know, this SharePoint web interface becomes sort of this scientific gateway. >> Roger Barga: We are, and in fact, and we draw a lot of inspiration—for those of you who are familiar with the myExperiment project, which I think pioneered the way of sharing DAG’s, sharing workflows— combined with, basically, how could I annotate? For example, how could I annotate a model and say, “I’m stuck here,” or this … “I’ve cleaned the dataset in this way.” Think sticky notes you could apply to datasets to keep context. You’ve logged off for the night and gone to bed; somebody else gets up; they log into the workspace; they see not only what you changed, but also the see context, the notes you’re leaving behind them. We’re even stepping back and thinking about a social site we could add, so people can actually just blog and ask a question, like “How the heck do you use this dataset or this model?” And other people that aren’t even in your workspace and seeing what you’re doing and sharing can actually ask questions as well. ‘Cause we think it’s new enough—we think there’s enough people interested—but it’s new enough that people need a large support network. And so if we could light that up around this service, we think that would actually change things. Yes, Bill? >> Bill: What happen … the number of boxes that you need I can imagine sort of converging downstream of the pipeline. If the number of different algorithms out there doesn’t make it all that fast, you can get a nice library of ‘em and get a nice [indiscernible] provision. But—sort of—the upstream stuff—I mean, the data cleaning, and feature engineering, and ingest kind of phase … >> Roger Barga: Yes. >> Bill: … seem like each problem might need a new box to be … >> Roger Barga: Yep. >> Bill: … created. So—you know—an SDK plus C sharp, I guess, is where … the sort of approach right now. Is there room to sort of make it easier to create new boxes, or is there different support than just what visual studio offers? >> Roger Barga: Yeah, that’s a good question. So some of the stuff we’re working on—again, I don’t … we’ll see if it’s still up—is, basically, apply R. So I have an arbitr … I can bring up an arbitrary R box at any point in time, ‘cause you’re right: feature creation, and transformation, and data munching is something you’re never gonna put the genie in the bottle—you’ve really got to let people use Python, R, and write their own script, but then say, “Now that’s reusable by everybody.” In fact, even have a drop-down menu of things that they can do, and we log them. Now, that becomes the black box if you run a data … another dataset, a redo log applies to it. There’s something called Monaco Tools, which is an internal prototype we’re doing, which is kind of a pop-up C sharp, F sharp—you know—interactive; you can type your code in, so if you’re more proficient in one of those, we’d like to pop one of those up and allow you to change your data that way, but then save that module so other people could use it. I’d like to keep people from hope … having to open visual studio—it’s scary for a lot of people—I don’t even like it anymore after having played with tools like this for years. Please. >>: How do you see something like this playing into, like, reproducible research? I mean, can … you can take snapshots of that, and save the data away, and share that in sort of like a myExperiment kind of way. >> Roger Barga: Yup. So one of the things for reproducible—and again, my service is unhappy right now for some reason … okay—is that basically, I can take any one of these models, and if I decide that it’s the gold standard on which my lab is gonna based, I can lock it, clone it. I actually can go back and—in runs—and actually do lineage tracking for what people did at each step, and guess what? We’ve persisted the data at any one of these pins right here, so if you wanted to say, “Well, exactly what were they looking at at that point? What prompted …?” So now, this comes at a cost; so we allow scientists to turn it on or off, or researchers to turn this saving off or throw the dataset away, but I want to give you the complete ability to go back for three months of research and say, “Dang it, I’m gonna fork off of that,” or “Now I can see exactly what they did to the raw data, how they ran their experiments, their thought process,” to the point where you can actually publish it and sh … and give that to somebody else—the whole model. So I see that my handler appears ready for our next speaker. So thank you very much for your time and attention. [applause] >> Wenming Ye: So thanks, Roger. That was our cloud machine learning group. Roger’s from our cloud machine learning group, and he’s the principal manager over at the cloud. Neck … our next speaker here is doctor Marty Humphrey, and we’ve been working together for quite a long time with the Azure for Research program. Doctor Humphrey here has been experimenting with one of our tools that we built here at Microsoft Research, called the FetchClimate, and he’s gonna talk to us about how you do a large-scale visualization on environmental dataset. And Marty, it’s over to yours. >> Marty Humphrey: So I was—thank you, Wenming—I was just checking to make sure that the system was responding, so I could actually give the summary right now. So in summary, this is what we built. [laughter] [applause] Okay, fine, I’ll give you the talk. [laughter] I‘ll return to that. This is … we’ve had the pleasure of many years of working with Youngryel—he was a Berkeley; he’s at Seoul National University now. He’s largely the driver of the science; I’m a computer scientist, and so I tend to not know what’s going on, but I’m used to that, so … so to give you a little history of the project, we’ve been doing this since 2009—a fair number of you have seen various people talk about this, so I’ll zip through the beginning material. It started with my student going to Microsoft Research for a summer internship; he met various people at Microsoft Research, notably Catharine—who was driving a lot of the science behind this—and Dennis Baldocchi from Berkeley as well. What they were trying to do was do large-scale environmental analysis based on satellite imagery. The satellite was MODIS—you know, if you have any questions about MODIS, we have a … the expert in the audience, Geoff, so ask him anything you want. Essentially, these things are spinning around the Earth producing imagery, and there was a desire to combine the different products into some sort of analysis of what was going on with regard to the Earth. Youngryel was particularly interested in evapotranspiration, which is shown here. For years, I referred to it as the breathing of the Earth, until one of my students actually corrected me and said it’s more like the sweating of the Earth—which I liked; I don’t know if it’s true. Geoff can correct me later. >> Geoff: Yeah, that’s good. Yeah. >> Marty Humphrey: Sweating? Okay, so [laughter] the nature of the problem statement was: can we do some large-scale based on the satellite imagery and focus on evapotranspiration among other things as well? So the nature of the problem statement was: we’re gonna have these satellites that are spinning around the Earth producing a whole bunch of pictures, essentially, and we’re going to divide the Earth into a grid of horizontal and vertical, and we’re gonna try to focus on analyzing what’s happening on each of the regions of the Earth—and this is very much a … lots of data, data parallel. And so the emphasis in 2009 was: can we actually use the new Microsoft cloud as the basis for doing the large-scale data analysis? So the group, that I was only briefly involved with at the time, created a data pipeline and analysis system inside the Azure—burgeoning Azure system at that point—and the nature of the analysis was to download a whole bunch of raw data files, primarily from NASA, combine them inside Azure in a two-phase analysis. The first one was sort of data preparation, which had had to be reprojected onto a different gridding system, and then the second phase was a large-scale analysis of combining the different products from the satellites and to create some sort of analysis of what was going on, again, at Earth-scale essentially. And so behind the scenes, we architected it to … initially to be running entirely in Azure. We since changed that a little to have it be able to run within our enterprise as well and also burst onto Azure as necessary. So we re-architected it at some point—basically our head node was in Azure, and we had the ability to come back onto our enterprise cluster or also to burst onto Azure itself when we had the need to do more analysis than we had physical machines within the enterprise to do. And of course, we had a rudimentary web interface that you could submit your analysis jobs, and they’d land on the various machines in the back-end. This is the picture from the head node that’s running inside Azure. So at this point—couple years ago—the system was working; we were doing a reasonable large-scale analysis—a lot of hand-holding of watching the software, making sure that it ran okay—but basically, we were producing the data we need. This is from Youngryel in particular; he produced this visualization— again, this was a couple years ago—and the real problem at this point was we had to suck all the data outside … out of Azure and bring it down onto the desktop, and then create the visualization via MATLAB. And this was … the decoupling of the data generation and having to pull it down from the cloud was kind of a real pain, and so we tried to look at how we could make it more end-to-end. And so around the time Youngryel, in particular, was approached by Microsoft Research, saying they have this really interesting new tool called FetchClimate, and could Youngryel try to integrate his data generation with this new tool. And so that was the real challenge that I’m really trying to talk about today: how we could take that data and instead of pulling it out of the cloud and doing it on our … the visualization on our laptop ourselves, can we do it as a service and keep everything within the cloud itself? And there was a lot of computer science, part … computer science issues on how to get this all working. So this is the—a little bit jumping ahead, and then I’ll come back into the gory details—but this is the interface that the user is presented with. This is a service that’s running inside the cloud; we stood it up at MODIS FC2; it’s live now—this is a snapshot of the system I was showing a moment ago. It has a series of dialogues that you can go through to, in effect, tell it what data you want to visualize, when you want to … over what period of time, the region of the Earth, and then, essentially, you ask it to go and produce the data. So this is the “when” dialogue, and at this point, we would specify that we want to visualize this particular region. I jumped a little too quickly, but essentially, what I want to do is: from 2012, I want to do a monthly average of this particular region. And then, you’d hit the “go” button, and it produced a visualization that I was showing a moment ago, which again, I’ll return to in a second. I wanted to jump into some of the nice features of Azure that we exploit to be able to do some of this. This is the console—I’m sure many have seen this kind of console before if they’ve run stuff in Azure— this is the console that we use to control the primary service that’s running inside Azure, that’s doing the data aggregation and visualization. And there’s a particular aspect of the UI which is really nice that we’ve been using—which is both good and bad—is the ability to manually scale it in response to the anticipated load that we’re going to be generating. It’s a really nice feature to be able to do this dynamically. And so for example, this is just showing how we’re setting the number of cores that are running on the data analysis—and again, we would set this if we had a lot of visualization that we wanted to do, and then we could crank it down as well when are in a relatively more idle period. So I wanted to show you a couple snapshots of it sort of at work. This is the region for all of 2012 … this is a particular region that we’re looking at—and again, this is the user interface from FetchClimate—and so we’re really happy that the system is working right now—we have it integrated with our particular data analysis—and so down here, you can select the different month that you want to see. And as you can tell, in January, this is trying to show evapotranspiration; the visualization is, in effect, saying in January, the Earth was not really sweating. And so if you moved it out into June, you can see the Earth is visually starting to sweat a lot more. We think the particular—this is somewhat obvious to this particular community—but the real power of this is to easily and visually be able to determine how the Earth is changing; we … this is a particular year—this is 2012—but we’ve done some analysis—the data was being generated by MODIS satellites essentially around the year 2000. And so we even want to look at the short time pan … short time period of the year 2000 to the year 2012, and visually, can you see the Earth changing? And being able to visually assess this is really particularly good for, say, the less scientific amongst us. I guess that’s a politically reasonable way to say that. The system has the ability to deal at different granularities, and I just wanted to show you how you can dive in with different granularities, controlled by the number of cells in the x and y region, and I just wanted to show you visually how that would change. And so this would be the very course-grained analysis—the obvious negative of it is how course the visualization is, but this would come back fairly quickly—and as you want to get a finer granularity, of course the queries would take longer, and the service would be pulling in more data and do day … more data analysis, but I wanted to show you how deep we could get in with regard to granularity, and I think I have a couple more. So this is show a really fine-grain analysis of what’s going on. Our raw data is at the granularity of one kilometer; I’m not sure what the granularity here is. And then, a couple more. So some more on the architectural details of what we need to do to run inside Azure to get the visualization working: the final system that we had running—that’s part of what’s shown here—is partly development and partly production. So we needed three virtual machines and up to two hundred cores for dealing with the data processing on the front-end. We needed seven cloud services, three SQL databases, and ultimately—the data that we’re looking at now is all of 2012 for each of those regions— and so we have about two point three terabytes in Azure right now, just sort of waiting to be sucked in and visualized. I wanted to show, briefly, a little more of the architectural details within FetchClimate. Here, I make the snide comment about the color, but we’ll just cut some slack to the Moscow State people. But this is the architectural details of what’s going on with regard to FecthClimate, and essentially, the takeaway here is a fairly complicated system—a lot of moving parts. So I … the nature … the title of my talk was “Experiences,” so this is where I start getting into the more negative aspects of what we had to encounter. At first, I decided to call them hiccups in the whole development process, but then I realized: these weren’t hiccups, these were actually research opportunities. [laughter] So here, they’re research opportunities, not hiccups. As you can imagine, the theme of the difficulties in integrating the systems was essentially all because of scale. Things go wrong. The actual science that we were doing were very data-dependent; we were running this—the analysis—very, very many times; the raw data … each computation required a different collection of input data. We designed the application to be—you know—resilient to missing data, but then on a … on each particular invocation, it was actually … this was … a negative aspect of what was going on was that the application was completing, but it was difficult at times to determine if it was … the science output was actually wrong or if it was in fact performing well because of the lack of input data. And so at scale, determining if a particular computation is scientifically valid can be very difficult. A little … not as interesting—arguably—is: when you’re doing so many things inside the cloud, the data movement can get … you can lose track of what’s being held where and what’s being replicated by mistake, and so we had many, many files, and we’re not sure where they were at particular times. Small times large equals large, especially economically, but in particular, these last two were really sort of indicative of what was going on. We had a particular analysis code that we sort of designed to be correct, and then, we ultimately ran it, for this particular computation, about seventy thousand times, and it was very difficult at times to track that and figure out what was going on. This particular application was run seventy thousand times—we had a collection of auxiliary programs as well, and so the cluster that we had inside of Azure that was doing all the data analysis: it was just a normal Windows head node. Ultimately, to do the analysis, we had five hundred and eighteen thousand jobs just to do that, and sort of keeping track of five hundred and eighteen thousand jobs was not the easiest thing in the world to do. The head node kept track of them, but you know, us—the humans—that was a little more difficult. And then another … this sort of falls into the category of: be careful what you wish for, because we had a visualization system; we hooked up all the plumbing; and it was basically working—we were able to visualize the stuff that we couldn’t do before—and so that was very nice. And so we just said to ourselves, “Hey, it’s working. Clearly, we can just throw as much data as we want at it, because it’ll just keep working.” And so we got into this phase of just throwing more data at it, and then we sort of said to ourselves, “What do you mean? How can we not visualize the entire Earth at one-kilometer in less than five minutes?” Those … that we just sort of a … it went from prototype, small-scale, working really nicely, and then we just sort of made the huge leap into: okay, let’s do the Earth, and what’s going wrong? So we have a number of things that we want to address and what we’re starting to do now. And there’s a theme here of what we need to do next to sort of really get it working is to make this huge system— which is really working nicely, but perhaps unpredictably with regard to duration—we want to sort of figure out what’s going on and be able to predict how long certain things were … are gonna take. So there’s a whole category of interactive dialogues we want to create into the system that are essentially … the theme of it is: do you want to continue? And so for example, we want to facilitate the scientist being prompted with—you know—you’re about to request us to do something that’ll take two hours; here’s something that we could generate for you in a minute that’s very close; is this something that could satisfy you while we go off and do the two-hour computation? Because at scale, sometimes this really takes that long. So matching the intentions of the query to the system with ones we’ve already done would be a nice thing to do. We also want to be able to predict how long certain things, in general, take. That computation that I was checking—I’ll put up at the very end—that computation I was checking was in part because I did not know how long it was gonna take, and so I had to see whether it was actually gonna come up in time, by the talk. And so we want to be able to compute how long certain things will take prior to their being executed. We also want to connect the visualization back to the data analysis in the first place. Notice we had these two decoupled systems, where we did a whole bunch of data analysis, and then we couldn’t visualize, and then we could visualize. So we connected everything, and so we want to be able to drive from the visualization. In effect, the user wants to look at this particular data, and we want to be able to, at that time, come back to our dataset and say, “Well, we don’t have that, and so if you want to really do that it’s gonna take us about four hours to go back into the raw data.” We also want to do this, because I’m particularly … I’m a particularly well-known cheapskate, I guess. If you’re willing to spend a certain amount of money, then we can do it much quicker—we want to fully connect the economics of the cloud with the scientific aspects. Again, perhaps I’m just a cheapskate, but I’m really interested in doing that. And so the summary of the system and the talk is that partnerships between the domain sciences and computer science I put up here as both challenging and awesome. Youngryel is doing all the science; we have all the knowledge—or we pretend to have all the knowledge—of what’s happening inside the cloud, and making the connection between the two is particularly … it’s a good time for this—Roger was talking a little bit about this. It’s a particularly good time; things are really cool right now; we can really hook things together. We tried to do this in both the visualization system and the data analysis system. And I guess lastly, oddly, we’ve found that the laws of physics still apply at infinite capacity or scale. We … I guess we could have predicted that, but we went from very small-scale visualization to very large, and we sort of thought it was gonna work, and the laws of physics were getting in the way. Trying to get our heads around that and presenting that as an … as useful information to the end user might be something that we need to do next. And so I guess lastly, I’ll just come back to the running system to show you that it’s out there now. Again, this is a fairly detailed granularity; we can dive in a little more, but this is the live system right now—this is Internet Explorer, and that’s the URL that it’s sitting at right now. And again, you can sort of visualize—this is 2012—what’s happening with the Earth in … as a result of the analysis of the MODIS satellites around this time. So thank you for your attention; I’d be happy to answer any questions you might have. [applause] >> Wenming Ye: Any questions for Marty? >> Marty Humphrey: Geoff … not Geoff, no. [laughter] >> Geoff: So I … what other datasets go into doing the calculation besides just the MODIS images? >> Marty Humphrey: I’m anticipating that you don’t want me to say none. [laughter] >> Wenming Ye: Marty, can you repeat the question please? >> Marty Humphrey: Geoff is asking me about what a … a very good question that, unfortunately, I don’t know. The very good question is: the analysis that is used to … the analysis that is shown visually here—I’ve presented it as the analysis is only using the MODIS satellite data as input, and I … so Geoff was asking me what other data is used as the input, and I don’t think any, but unfortunately, I think Geoff thinks that there must be something else. And he’s, of course, probably right. >> Geoff: Well, but you would … >> Marty Humphrey: I know what data’s being sucked from what FTP servers, and I know what data’s on the system when the analysis job runs—I think it’s all from the satellites. >>: [inaudible] distance, yeah. >> Kristin Tolle: What I’d like to … is this on? Can you … is this … >> Roger Barga: It might be a broken one. >> Kristin Tolle: Oh, okay. Well, here … >> Roger Barga: Here, use mine. >> Kristin Tolle: So I’m gonna hop in here and help you answer that question, Marty. [laughs] >> Marty Humphrey: I appreciate that. >> Kristin Tolle: So … >> Wenming Ye: Introduce you fir … self first >> Kristin Tolle: Oh I’m sorry. I’m Kristin Tolle; I’m with Microsoft Research, and actually, I just released FetchClimate last week to the public at http FetchClimate—all one word—dot o-r-g. So if you want to test this out yourselves, you can actually use your own instance. The instance that Marty’s showing here is one that they’ve deployed locally on their own Azure service. You can also go out to our website and you can download a copy and deploy one—just as Marty has done—because in the release that we did last week, there’s also a deployment system for you to get from that site. Inside the standard web service, there is … there are dozens of different datasets that actually are available to you, using the web service version, but if you want to have your own private instance, you have to load up your own data, which is what, essentially, this project is doing—which is why MODIS is the only one in this instance— but if you want to use any of the other systems, like NCAR, you can actually go out and use the FetchClimate dot org one and see what’s available to you there. Thank you. Sorry. >>: You’ve downloaded image for satellite; is that right? >> Marty Humphrey: The MODIS raw data is stored on a couple different NASA FTP servers. We pull that into Azure. >>: Do you have any idea what’s the delay for the data acquired to the data available? [laughter] >> Geoff: The … there’s a system with the MODIS data that you can get it within about an hour. >> Marty Humphrey: Is that right? >> Geoff: Yeah. >> Marty Humphrey: Interesting. >>: Though, because you’re … yeah … the cloud cover and things like that, you have to … depends on what you’re looking for. >> Geoff: Right. But the … but there is a mode of—it’s through what’s called the LANCE system—that if you’re really interested in real-time … >> Marty Humphrey: But within an hour … >> Geoff: … sort of acquisitions, you can get it within about an hour… >> Marty Humphrey: Wow. >> Geoff: … acquisition. Now, when you’re getting it from the NASA archive, it’s typically a couple of days, but … so it really … so the answer really depends on whether you’re trying to do something in real time, which you’re not. But if you really want to do something—like, if you want to know what the Earth … what the rate is at which the Earth is sweating now, [laughter] then you can do that. >> Marty Humphrey: You don’t have to go down to my level of discourse but, you know … >> Geoff: Right. >> Marty Humphrey: … I appreciate that. >> Geoff: Yeah. >> Marty Humphrey: You don’t have to refer to the Earth sweating if you don’t want. >> Geoff: It’s fine. It’s a good term. [laughter] >> Wenming Ye: We have another question from Scott? >> Scott: Yup. In your workflow, you had a switching between the HPC and the cloud … bursting out to the cloud. Was it a local HPC that you had? >> Marty Humphrey: It was, it was. >> Scott: You had the head node in the cloud? >> Marty Humphrey: Right, so our first creation of it was ad hoc, new Azure mechanisms—this was 2009. About a year after that, we said, “Hey, a lot of what we’re doing manually we could get the Windows HPC cluster software to do,” and so we re-architected it for that. We ran that locally within our enterprise, and then the fine folk at Microsoft created … released a new version of that software that would be allowed to run directly in Azure. And so at that point, we just started running the head node within Azure, and then we would sometimes call back into our enterprise resources. >> Scott: So is it separate or is the head node actually calling the local HPC cluster? >> Marty Humphrey: We have different variations of it … >> Scott: Or is it parallel? >> Marty Humphrey: We have different variations of it, and our most recent one is: the head node is sitting in Azure and it’s just using compute nodes in Azure, so we really have no need to come back to our rapidly-aging hardware within our enterprise. >> Scott: Oh yeah. >> Wenming Ye: We have another question from Tanya? >> Tanya: Can you put the slide back about time query run … query runtime estimation? No, one before that. Yeah. So … >> Marty Humphrey: Right. >> Tanya: … how generalizable is that to other domains, because I—as … from the previous talks, data science is a big part, and scaling up is a big part of science, and so estimating run … query runtimes I expect is gonna be a big issue, and giving that kind of feedback to scientists. So how do you do that? Can you talk a little bit of how you do that now? >> Marty Humphrey: I agree, generalizing duration of scientific experiments at scale is insanely difficult right now. How we particularly … the only way we can do it, or the only way we can attempt to do it is: given a detailed understanding of the domain and the data—the raw data that’s being stored—we’ll be able to figure out how many blocks are retrieved from Azure storage, where they are, the resource selection policy within Azure, that sort of stuff—very detailed, admittedly very finite analysis. I agree, it’d be very nice to be able to generalize this, but … >> Tanya: Computational complexity on the cloud, you know? >> Marty Humphrey: Yeah, I’m … again, there’s a very general theme that there’s so much capacity that you can sort of do stuff, and now you want … actually want to predict how long it’s gonna take, and that’s incredibly difficult right now. >> Wenming Ye: Okay. Thanks, Marty. [applause]