32352 >> Vani Mandava: So our next session is something... have had to deal with, every single one of us...

32352 >> Vani Mandava: So our next session is something that I'm sure all of us have had to deal with, every single one of us in this room. In all our research that we do have to deal at some point or the other on how to extract information from data. Our first speaker is Shahrokh Mortazavi. Shahrokh is a partner program manager in Microsoft. He currently works in the developer division of Microsoft on [indiscernible] tool chain. Previously he was in the high-performance computing group. He worked on the Phoenix compiler tool chain which is co-gen analysis and [indiscernible] tools at Microsoft Research. And before that, for ten years, led some micro systems cogeneration and optimization compiler [indiscernible] teams. Welcome Shahrokh. >> Shahrokh Mortazavi: Thank you. All right. My goal today is to keep Ed Lassiter awake. If I can accomplish that -- no. He's my mentor. >>: [Indiscernible]. >> Shahrokh Mortazavi: him. He's my mentor. So originally I was going and data science. Then I it and go from there. So decent platform for doing order bit. Everything I know I learned from to give a talk on the whole workflow of E-Science said, you've only get 15 minutes. Pick a slice of I hope to show you guys that, hey, Windows is a data science and E-science. That's really the high I looked around on the net. There was a whole bunch of really cool workflows for data science and E-Science. And I looked at the list of developer tools that they mention. Here's one of them. You see Excel, Java, Python, R, Wicka, Hadoop, of one or another, Flume and R Wicka and D3.j, everything. So I looked at some more. At some point they started to basically start repeating some of the same tools in program language, visualization, big data, MapReduce stuff, storage. So buckets would be basically gathering data, cleaning, lunging, storage-related stuff, analytical tools, the language and scrip thing, which is mostly what I want to talk about, and, you know, productivity things like IDEs and so on. And of course for this audience, publishing and sharing is probably up there. The very first one was Excel. You know, love it or hate it, we just did a survey of a whole bunch of data scientists, Excel was one of the top tools that they still use. And there's a love/hate relationship. It's everywhere. It's got pretty decent import and export stuff. It is scriptable. You can write UDF if you want for running RTC code. There is a web version of it now that's pretty good. If you've been using Excel spreadsheets, you should really take a look at this. It's pretty decent, but it's got some warts, too. When it comes to extremely large inputs, this is not your tool. Version control has always been an issue. How do you version this stuff? Do you just change something in the name or do you use GIT and save it as XML and do something else? And if you want to do anything, any logic behind it, VD script is pretty much your only choice, up to now. And the functional library, when compared to all the stuff that are available in scipy and other, you know, R-related libraries is somewhat limited. But there's some great add-ins coming up that I think some of our colleagues will demo later that give it a whole new take on analytics that you might still find useful as E-Scientists. One of the things that people really dislike is the fact that you're stuck in VB script. These days, there are a number of libraries such as pixel and pivot and others that allow you to extend what you can do in Excel as a UI tabular focus tool and use stuff like Python with it. So for that, I'm going to do a quick demo just to show you. This is, by the way, PTVS, Python Tools for Visual Studio. It's an open-source plug-in for Visual Studio that gives you C# level support inside Visual Studio. You can download it and stick it in VS. It's free and does a bunch of magic for you. Specifically here I'm going to do import Excel. And let's see. Excel dot view, and get nice completions and so on. Let's say do a random hundred. And here you are, in Excel is a live bridge between Excel and Python. Do what you want in Python that Python does well. Do what you need to do in Excel that Excel does well. Use the right tool for the right things. Here, let's say you want to insert a chart. Don't recommend a chart? Oh, what do you say, Excel? Ah, nice, scatter plot. There you go. So you can essentially create a numpy 2D array, 2D matrix, and then zap it on to Excel. Do a bunch of filtering and so on, and then live objects show up back in had Excel. You could also do things like, let's say, if I just want to just show that within Visual Studio, there it is. So you can actually have inline graphics. This is using the iPython back end to actually send the code to a Python interpreter, get its response back, which could be text, it could be visuals of PNGs or videos or whatever, and display it right there inside Visual Studio. Okay. So let's see. Let's go back to the slides. addressed, is the key message here. So Excel, the pain points can be Moving over to scripts and languages that you might use for E-Science, you know, there's a whole bunch of scripting languages. Python and R are the dominant ones in the zip code, M in MATLAB if you are a MATLAB user or maybe an open source version of it. Octave might be your thing. Julia is coming up as a better version of N and Python combined, definitely faster in terms of execution speed. And as far as standard languages, the go-to languages are Java, C++, still some Fortran. And if you're on Microsoft, you definitely want to check out C# and F#, which, by the way run very nicely on Macs and Linuxes now. There's a new compiler infrastructure called Roslyn. Essentially it's a rewrite of the C# and compiler family, and now it's open source and available on CodePlex, and the good folks at Xamarin are making the whole thing available on Linux as well, so you can even write your Android or iPhone app in C# and write in that environment. So speaking of Python, which I work on, so a little bit of bias here. What's equality and availability of the tools stacked on windows? I would say excellent with a little asterisk. If you're installing Python libraries that rely on C++ in parts of it and you don't have that, it's a painful surprise. So for that, I suggest you stick to reputable distros such as Anaconda and Canopy. These are some screen shots of some Python IDEs that are available out there. My favorites, surprisingly, are Python Tools for Visual Studio and IPython from Fernando and Brian and Mihn [phonetic] at Berkeley and Obispo, I believe. These are cross-platform. PyCharm, Wing, Komodo, PyDev. So if you are in an environment that you need to work on Mac, Linux, and Windows, forget about Visual Studio. Go ahead and use one of these cross-platform ones and keep your life uncomplicated. As an E-Scientist, the core setup that you need essentially is a CPython interpreter 2.7 for maximum compatibility or 3X for the latest language features. Scipy, numpy are your friend, and pandas now has a library for essentially doing -- giving you the data frame capability that R provides you. And then IPython Notebook which I'll demonstrate later. Like I mentioned scipy just to give you a -- in case you're not familiar, Python is pretty much -- let's see. Scipy -- become the lingua franca of a lot of scientific stuff. So whether you need to do wrapping of code and calling to native, plays well with it. Whether you want to call out to MATLAB code, visualization 2D, 3D, optimization libraries, parallel computing, there's MPI for py if you want to do MPI programming. And particular domain libraries, if you're in astronomy, if you're -- want to do basic statistics, biology, I mean, let's take one of these groups, let's say biopython.org and there's like very large active communities that have very deep and complete domain libraries. So Python is your friend. Next is R. We just did the survey that I mentioned, the two languages that pop out more than anything are R and Python. So again, it's high-quality implementation available on Windows, single click install. Very active community. Lots and lots of libraries that you can use. And if you're doing a lot of stats, it's still the number one tool, the go-to tool. Visualization is also probably its forte still over Python. In terms of IDEs and environments, IPython Notebook that I mentioned has been extended to support R directly. You can also use the IPython magic command, essentially percent-percent, call out to some foreign language and it will bring the data back in. So again, you can stay in the browser. And there's R Studio, which again is cross-platform. If that matters to you, stick with it. It has in-line graphics just like PTVS. And Revolution Analytics is another company that if you need to be having an 800 number to yell out when things go wrong, they have an enterprise version that's available. Fortran, yes. Believe it or not, a lot of people that are doing -- yes. there's one right there. It's not dead yet, and it won't be. Yes If you're doing number crunching and you want the absolute best compiler optimizations for your inner, inner, inner loop, and the best libraries for FP crunching in general, if you want to do NPI, if you want to do open NP, it's still the go-to language. What's availability on Windows? It's pretty good. Not as many options as there are on Linux, but the two big ones are Portland groups, PGI environment, and Intel's Fortran. Both of them, if you are in Visual Studio, plug into the VS environment and provide you the full goodness and sweetness of debugging inside VS. Then there's mat libraries that sit underneath all of this stuff. The commercial ones are NKL from Intel and ACML. There's Rogue Waves and there's like the Visual Numerics that became visual Rogue Waves for .NET libraries. But essentially, all the important ones are available in Linux are available on Windows as well. And of course, R and numpy and scipy have their own set of stuff that are available where they can directly tap into for a particular domain. So essentially, the full mathematic and numeric stack that's available on Linux and Mac is available on Windows. Next, as a person that's slinging some code for your models and for simulations and so on, especially I see a lot of Macs here and I assume some of the other machines are running Linux, you know. If you're coming to Windows, you want to bring your tool set, right? You don't want to go learn a whole bunch of new stuff, especially if your stay in Windows is short, right? So VIN [phonetic], E max [phonetic], sublime, whatever, all the top, you know, 5, 6, editors that people use, high-quality implementations are available on Windows. So when you all tap to Windows, you don't have to do a massive mental shift. There are a bunch of lightweight IDEs that come with these languages like pyscripter and IDLE for Python that you can directly use it with pretty much any distro. And there's some very high-quality web-based IDEs coming up like Visual Studio Online. You can use it from any browser, any OS, do your editing, do your GIT, and push stuff to Azure if you need to, but all the core features are there. And there's a whole family of full IDEs such as Eclipse and PyCharm and Komodo and Wing that are cross-platform. Again, if that matters to you, go with them. If you are mostly on Windows, check out PTVS. Very powerful environment. On R, R Studio, again, is a beautiful environment that runs everybody. revolution analytics that I mentioned. The And Fortran and friend and VGI have very nice plug-ins for Visual Studio as well. And if you're doing .NET stuff, Xamarin Studio again is your friend. Runs everywhere and gives you a whole bunch of support for environments that you wouldn't think normally such as the iPhone. The up and coming quote/unquote IDE that everybody, if you're in E-Science, should check out is IPython that I briefly mentioned. Think of it as a super ripple that gives you history, that gives you completion, that gives you syntax highlighting. In the notebook version of this, so over here, this is the terminal version. This is the QT console that runs everywhere and this is the notebook version of it that runs in any browser. Basically, you have two -- I think I should just demo this. So here's one that I quickly set up or you can set up, let's say IPython Azure. Let's see what shows up. Oh, great. So if you go on Azure, you can very quickly set up an IPython Notebook using Linux or Windows VM, again, because Python is cross-platform. It doesn't matter. All the software is available. A few commands, and you get yourself an IPython Notebook. The interesting thing to note here is that you've got math jacks rendering of your formulas. You've got code cells. You've got markdown cells, and this stuff gets sent to a Python interpreter. It could be R interpreter. It could be Julia. Doesn't matter. And whatever comes back get nicely captured and displayed for you. So here, here's a machine learning demo that basically is fed a bunch of faces and then train and then see if we can recognize some new faces thrown at it. I click here to create a cell. It's just your regular Python. I just did a shift enter. Let's say I say X equals lin space of zero to 5. Give me 20 samples. So there at this time, I can do X squared. Shift enter. Maybe even plotted. X versus let's say sign of X squared. This looks ugly. Let's get more samples. Much better. So the beauty of iPython, whatever language you end up using is that I'm using -- I was actually going to bring my Mac and do this on Mac using Safari talking to a Linux back end on Azure just to make sure there's no Windows involved, but I have a PTUS demo to run versus installing a whole stack and telling it install Windows, update this and .NET Windows, .NET update version, data, whatever, the installation now is here's a URL and a password. You're done. I already set up everything, maybe even provided you with a click install, a single click VM install of everything you need for your particular scenario. So and on top of that, there's a whole set of outputs that you can capture if from IPython that allow you to get a static HTML version of your notebook such that people can just look eight. So it's a Ruby notebook, is it Julia? Let's see. One that I should just flip through. So this is an HTML capture. Again these are markdown codes, this is executable code that I can shift enter on, I can move around, I can edit them. But basically, the whole concept of an executable paper that you can change a URL and I can take your algorithm and then run it against my data, this is the closest that people have come to that's available for everyone open source, free, and with no weird gotchas. Okay. So that's iPython. So Python Tools for Visual Studio, I talked about it a little bit already. The things to mention is if you're teaching, you can pick up the express edition. It has full Python support in it for web programming and for doing computational stuff. And when I say Python support, I mean any interpreter, whether it's CPython, IronPython, which is Python on .NET or Jython, which is Python on JVM, et cetera. It's got cool features like remote debugging on Linux. So we have a lot of customers that actually develop on Windows. They find development on Windows to be productive and they actually run their code on a Linux forum. So this allows you to actually set up break point right there on your server and then have the goodness the Visual Studio right there. I sort of mentioned the integration of IPython. You can, say, select a bunch of code, send it to the reppel and then you get nice graphics. So you can do a very nice interactive development of your algorithm. You can switch to the editor and set up a break point. Let's say if my conditions are met, good, and print my process IDE and machine or if I have a -- put a bullion condition there and then run the code. Maybe putting a break point in a loop was not a good idea. But anyway, you can use the various -- so essentially look at it as a MATLAB-like environment, but built around Python and all it's libraries but inside Visual Studio. Right? Everything you do in MATLAB with a prettier language. And free. It is a pleasant language. I mean, MATLAB is great. I love MATLAB, but boy, everything better look like a matrix if you want to be happy. Another cool thing it does have mix-mode Python and C++ debugging. This is the only environment on the planet that does that as far as I know. Python is interpretive. It can get slow. You run the profiler inside Visual Studio, it shows you where things are falling on their face and you write that stuff in C++. The good thing is Python plays well with other languages, but you know as you go multiple languages, problems start and you need a debugger that knows both environments. In fact, if you look closely here, do I have a laser pointer? So this is Python, that is not Python. That is C. That is a mixed mode stack. And if I had done this properly, you would have seen that this is Python C++ Python stack right there. So if you have written Python C extensions, you will probably get very excited. Oh, that's the other thing. What [indiscernible]? Look at that. So you can actually run your code on a Linux server and then just remote attach to it and debug it and [indiscernible] your code that way. So next one in that bucket was visualization. Again I would say things are improving on Windows. The workhorses are obviously ggplot2 and R Basic graphics and then for Python or [indiscernible] for 2D graphics and it's got some support for pseudo 3D graphics. More than stuff are Bouquet from Continuum Analytics. Plotlib essentially gives you a very nice dynamic. Actually I think I made this a link. Let's see if that works. Wow it did. Even in IE. So dynamic in-browser charts which could by the way be embedded inside IPython notebooks. Okay. So you can imagine IPython notebooks that you have something you want to run that say, you know, some progression of some disease versus year and it can give you your user lever that they can actually move with their mouse and it changes the charts dynamically. And there are things like mayavi that provides you with a nice 3D support. Okay. As a developer, one other issue that comes up fair amount that I wanted to touch on is you're a UNIX guy, you're a Mac person, and somebody says you need to go build this for Windows and then you get some C++ code and say, oh, man, I don't want to go and learn PowerShell and this and that and Visual Studio. What's the minimum number of calories I can spend to go in the lion's den and rush out? So first I would say if you end up doing this repeatedly, high recommend you install Visual Studio. It's not that bad and learn PowerShell which is like a really good command line shell, somewhat batched like it's got pipes, got redirection and instead of just sending text around, you can send objects around. .NET objects [indiscernible] spreadsheet in your pipe and do an inspection on it. But if you don't want do that, there's two nice option. Nice in quotes. One is Cygwin whereby you take your source, you bring it over to Windows, you install the Cygwin environment, which by the way will give you a bash and a bunch of Linux utilities, you feel right at home with a dollar prompt, and it's going to DLL that intercepts posits calls and turns them into Windows API calls. So that's one way. The other one is MingW which is again, both of these are on GCC, but this truly gives you Windows binaries. Instead of having its own run time like Cygwin, the actually talks with a V C++ native run time so you get true Windows binaries. So those are two ways that you can minimize your learning calories, if you do this occasionally. So create a Linux-like environment, temporary, build your stuff and get out. Cloud, again, I don't think there's enough time to talk about this stuff, but basically the new world is pick an OS, any OS, pick a language, any language, I don't know if you can read this but here it says -- even I can't read it. Ubuntu, syntos [phonetic], a bunch of Linuxes that you can choose from, a bunch of Windows OSS. In the languages, I think .NET, Java, no JS, PHP, Python, Ruby, and a bunch of mobile and media stuff. So whatever flavor OS or language you want, it's available up there. But the most important thing is the SDKs, for every language there's a nice SDK that installs on Mac, Linux, and Windows. And you can use it to the call the APIs. And for the rest of the stuff, which I don't have time to, but big data, big compute, analytics, there is a solution available. HDInsight is our spelling of Hadoop. So I went through all the tools, and I chose just these two but I took 3, 4 of these. I went through every single one of them and did a check on their website to see where they are in terms of availability. Actually even this one should have a blue check mark. Green is available and is pretty good. Blue is it's not available, but there's an equivalent available on Windows somewhere. And all of those were available except AWF. So place that with Azure. So if you want to do E-Science on Windows, your core stem cell stuff, you know, your tools and languages, all of them are available. My recommendation, stick with OSS tools, especially if price and sharing and community, all that stuff, matter to you. And use it for proprietary stuff, whether it's power BI, Excel or whatever, if it gives you a distinct advantage. Strange coming from a Microsoft guy, but that's my recommendation. And I think that's it. [Applause] >> Vani Mandava: Any questions? We have time for a couple of questions. >>: You mentioned [indiscernible] from Excel. Seems that you could do [indiscernible] and one of the things that [indiscernible] statistical analysis [indiscernible], the advantage of having [indiscernible] for every column, for every variable. Just was thinking that Excel would have [indiscernible] that metadata [indiscernible]. You would be able to use a lot of the R models and [indiscernible] models [indiscernible]. >> Shahrokh Mortazavi: Yeah. I know there are just like the Python bridges, there are some R bridges between R and Excel. And Excel has full COM API that you can take stuff out, munge [phonetic] it around and stuff it back in. Even if the core Excel doesn't support something you want, but if you have these two working together, I'm sure there's a way to make them live under the same roof, maybe sleeping in separate beds, but you know. I would have to understand your case specifically but I've seen amazing stuff done. There's a gentleman doing power PI I believe demo, so [indiscernible] show you some of the stuff they can install extend the core features of Excel. So my guess is it's possible. I would have to understand your case much more. You had a question? >>: So one of the [indiscernible] is not so much the beauty of the language but the availability of ->> Shahrokh Mortazavi: Toolboxes, yeah. >>: And that currently is just well beyond what R does and what's available in Python. And so I guess to what extent does almost everything you say also apply in MATLAB? >> Shahrokh Mortazavi: So I would say the slope of activity in MATLAB is this and in Python is like that. You know, Python catch is catching up at a rapid rate and if you look at side kick learn, if you look at the stuff like I showed you in the topical software, there are like if you're doing Simulink stuff like signal processing stuff, you know, MATLAB is definitely ahead, but wherever people are -- the new generation of researchers and students and so on, where they're putting their calories as I've seen is not in MATLAB. They're writing new libraries. They're contributing to these libraries, improving their quality constantly. So I would look at where Python was five years ago, ten years ago, and look at where it is now, and MATLAB is still a beautiful, well-integrated, nice environment, you know. I wouldn't say that there's an equivalent of that in Python. I would say probably PTVS is the closest thing that's trying do that but that's a Visual Studio for engineers. That's what MATLAB is. We're coming from an environment where it's Visual Studio for general programmer and we're trying to crowbar it into an engineering tool. But richness of libraries, yes, there are some things that are better in Python. There are some things that are better in MATLAB. I think the rate of improvement in the Python world is greater. Not sure if I convinced you, but that's what the data that we've seen those. Question? >>: [Indiscernible] good for scientific computing. I've heard some people -- I'm obviously using my Mac [indiscernible] people using the Surface Pro to develop especially in distributive systems because you know a lot of it is just [indiscernible] machine and you don't need that much power where you are. What are your thoughts on that? Are you guys trying to push that as something in that sort of developer realm? You plug it in, you go to a ->> Shahrokh Mortazavi: Very good question. Basically, it was hey, I've seen people that use a Surface Pro and SSH to whatever they need and do their -essentially as a lightweight terminal. Absolutely. We're trying to go one step beyond that. That's why we put a lot of effort into Visual Studio Online which is a lightweight IDE with full GIT support and IntelliSense and syntax highlighting for many languages and you can have your projects, you can track your bugs, you can, you know, stuff -- open in Visual Studio when you need to. So we understand that the new world of pick an OS, pick a language means I meet be coming very much from a non-Windows environment, right? So I need to be able to, from Surface or my MacBook Air, which I do from home, you know, I just go online and my world is in Azure. So that's a very legitimate way of development. It's just that what happens when your code gets really complicated and you need to do profiling, you need to do debugging or you have got 800,000 lines of code, we're actually at PyCon and one of the banks there told us they have 3,000 developers, 60 million lines of Python code. Like, whoa. So there are different people with different needs, but the one that you described is a definitely growing audience. >> Vani Mandava: Thank you, Shahrokh. >> Shahrokh Mortazavi: [Applause] Thank you. >> Vani Mandava: Our next speaker is Geoffrey Fox. Geoffrey Fox is a distinguished professor of informatics and computing and physics at Indiana University. He's the director of community grids lab, associate dean for research and director of the data science program at the school of informatics and computing. He previously held positions at Cal Tex, Syracuse University, and Florida State University. He has supervised the Ph.D. of 66 students and published around a thousand papers in physics and computer science. He is fellow of APS and ACM. His work currently involves applying computer science to bioinformatics, sensor clouds, earthquake and ice sheet science and particle physics. The title of his talk is comparing big data and simulation applications and implications for software environments. >> Geoffrey Fox: Can you hear me? Thank you very much. Let me just check. [Indiscernible] sort of being my ideas biased by parallel computing and simulations to big data and I often comment that you talk to people who do high-performance computing and simulations, they may argue about whether the -- exactly what the right chip is but the broad principles of what it takes to build a good supercomputer I think are understood and agreed. My impression is that's not true in the data science or big data area. So what I wanted do is start with some work which [indiscernible] was very instrumental in making happen which is in this activity starting big data which happened in the fall. And as part of that activity, I was leading the so-called use case or acquirement activity and so we invented a template with 26 fields and we bugged people to fill it in and we got 51 applications, which is the at least the people who filled it in felt were big data. Whether they were is not quite clear. And they covered these areas, government operations, commercial, defense, healthcare, deep learning, ecosystem, astronomy, physics, the various earth environmental [indiscernible] science and energy and I should note that the -- so I will draw some conclusions from this will be somewhat biased because if you look at the -- I know the big data cycles associated with these applications, they don't sort of average out to the way big data is processed on real computers because we don't have -- we only have one thing called search and one thing called Netflix and those are representative of a few applications and so it's slightly -- you have to bear that in mind in what I say. So this is -- this is a little more detail. These actually have all the 51 applications listed, each of which we have these 26 features. There's an excellent website. You can just type this big data, you'll find that website. And it has online all these use cases, all the analysis of the use cases and all the forms their filled in. And as I say, somewhat biased of science partly because of the Department of Energy and NASA put a lot of pressure on their labs to respond to this -- my request for cases. NSF did not put my pressures on their investigators. All right. So illustration of the type of information that was gotten here is just screen shot of some summary which has these magic Vs, volume, velocity, and variety. Has the software used and any analytic algorithms used from six of the use cases. And you can go and browse all the use cases and get that information. >>: [Indiscernible]. >> Geoffrey Fox: I will point out that these were not spell-checked and is full of illiteracies and probably incorrect statements and except where the correction was obvious, I did not correct the data. This is raw data. Has to be correlated somewhat. As well as this, Bob Marcus, who was on the lead of [indiscernible], he suggested ten types of applications which more come from the commercial -how do we go back? This is not ->>: [Indiscernible]. >> Geoffrey Fox: Am I missing the -- it's this thing. I think it's this. So he suggested these ten generic use cases which largely come from the commercial data processing world. But all we have is these ten titles. We don't have 26 features for each of them. So these just sit there as areas which possibly are not so well covered. We do have some commercial use cases like the Mendeley company who do citation processing. They have a pretty interesting entry. We have a generic discussion of financial operations and things like that, but and then we have also have more detail which didn't have the same profile on some pretty interesting security and privacy oriented use case covering things like education, the military, cyber security, and area like that where security and privacy were typically important. So we had 51 sort of rather open use cases filled in this detail, 20 additional use cases, which we have to bear in mind. So I went through those use cases and tried to understand them to try to draw the equivalent of the Berkeley dwarfs. People may remember the Berkeley dwarfs which were some summary of parallel computing from the past and so we're trying to capture the essence of these use cases. You can use the term kernels, mini apps, patterns, and I should point out I am incompetent in databases and so I will focus on non-database aspects of it. And if you want to go online, I actually have a slice there, it's a recorded video going through all 51 use cases, discussing how this is done. And so let's just remember what happened in parallel computing. We have the most famous kernel called LINPACK which is notable for not being useful but being used by everybody. [Laughter]. And welcome back to the replacement of LINPACK soon. I think more relevantly, we have the so-called [indiscernible] benchmarks. They were very influential, I think very important, very nicely done. They originally were pencil and paper. They specify the algorithm but eventually they became actual code which you could run with an MPI library. And there are also lots of specialized benchmark sets which NSF used to use to judge your solicitation and but the last two NSF solicitations have abandoned those benchmark sets partly because they're sort of cloud related. They're not quite certain what the right benchmark sets is. So Berkeley dwarfs is -- I'll give you a little bit on that, and they're possibly the most famous and Jack Dungar [phonetic] and others had something called templates, which is a book or two which he published. So here's the NAS parallel benchmarks so we can see what it is. It's basically libraries of very core algorithms which were buried deep in the simulation code plus [indiscernible] parallel not quite saying levelers of conjugant gradient sort of class of applications. We will find out from what we've done in the past that when you try to abstract the essence of these applications, it's not possible do it at the same level. You have to go all over the place because there are different characteristics and it's quite clear how to do it. So I call that facets. We have to look be at different facets of the applications. Here are the Berkeley dwarfs and they actually came from Phil Colella who is a brilliant -- I think he is a tri mathematician. I think he does a lot of combustion, if I remember rightly, and six of the -- the first six came from Phil and then the Berkeley added -- even added MapReduce. But again, if you look at this classification [indiscernible] MapReduce not quite in the same -- they're not looking at the same level. So I don't think that's bad. We're just -- I'll use as an excuse for not doing things at the same level for big data. All right so now let's analyze these 51 use cases. And so one interesting thing is what they -- if they're big data, they're probably running in parallel over. And you can see they're running in parallel over. Lots of different things. People and these are either people being analyzed or people actually using the system or both. Items such as images, gene sequences, material properties, sensors, events such as astronomical events, nodes of an RDF graph, nodes of a neuro network, tweets and blogs and documents, and web pages, files. And as one of the interesting sources of big data are simulations, those are parallel over standard simulation. So we're parallel over lots of different things. So now we try to look at the 51 applications and say what they used. again, this is not uniform. And So 26 of them I decided were pleasingly parallel. 18 of them are classic MapReduce. Some of them sort of halfway in between MapReduce and pleasingly parallel, having some relatively trivial global sums at the end to gather statistics together. 23 of them are candidates for [indiscernible] MapReduce or what I call global machine learning where the machine learning is optimized over the many nodes of a parallel application. Nine were graph-related. 11 involved data fusion. And 41 of those 51 involved streaming in some sense. So that was actually the most common category. So that I called low-level. Here are some other types of ways of looking at them. 30 of them were classifying the items which we were parallel over. 12 of them were doing classic search or query or indexing. Four involved collaborative filtering. 36 of them involved what I call local machine learning. That means that you do -- you take your problem, you divide it into parts and you apply machine learning to each part. And I think I have an example here. So here is one of the -- use case 18 there's computational bioimaging. This is like light sources. And this was submitted by Lawrence Berkeley Lab. And I believe they're running complex machine learning or image analysis on each image taken from this synchrotron source but I think they're doing it pleasingly parallel over images. And pathology which you also have as an entry, typically that's done parallel over pathology and images incredibly complex machine learning on each image. So that's what I mean by local machine learning. Global machine learning, which is actually my way of thinking roughly the same way as if there was a MapReduce that's a little less than half of them, you have things, deep learning, clustering, latent [indiscernible] allocation and things like large scale [indiscernible] dissent, et cetera. 16 of them involved go graphical information systems or the defense applications do or the earth and environmental [indiscernible] science involved GIS, so that's pretty popular. Five of them involved looking at the data from large-scale simulations, two of them used agents. So these are just ways of trying to understand the richness of big data. I gave you that one. So this one, that bio -- so that was local machine learning. So there's another very talented assistant professor within the [indiscernible] University, David Crandall, who actually works with Judy to analyze image data by effectively trying to take the world's images and trying to do a So global [indiscernible] or global maximum likelihood that determine the camera position and the orientation of each images so they reconstruct the world. So that's a good example of imaging with global machine learning agricultures opposed to the previous one which is the bioimaging which was the local thing. Okay. So here you are reconstructing all these things automatically. currently uses either [indiscernible] MapReduce or MapReduce. That And that one does start off actually with local machine learning to do the initial classification of the images, but after it does that, it does this global machine learning. All right. So then we can ask aspects of what's going on here. So we offer actually ask the same question you ask in parallel computing [indiscernible] per byte and communication, interconnect requirements and whether the application is constant or dynamic, is it a regular graph or an irregular graph? Most of the problems are bulk synchronous processing because most data [indiscernible] MapReduce is BSP. Some of them are asynchronous where you get shared memory. We already mentioned [indiscernible]. The data abstraction key value, being pixel graph is important. And then you could also try to work out what core libraries you needed, whether it's matrix or algebra, et cetera. And you can also try to work out from these things what issues about where the data comes from. Sets of files, Internet of Things, streaming, simulations, GIS, SQL, no SQL, et cetera. And that's -- and we also I mention the earlier, there's a class in here which is experimental observations where you get this block streaming where you gather data for a month. The work I do with people at the North and South Pole, they take data for a month or so and, unfortunately, there are no trucks in the North Pole coming to the USA but they can fly it from the North Pole to here. Okay. So if you look at all those use cases, here are five execution strategies using E parallel, which is probably a major -- it's probably half the applications [indiscernible] pleasingly parallel. Includes local machine learning. We can abuse Hadoop do this in the language of the first speaker, or we can use high throughput computing or many task tools. Then there is what you might call the classic MapReduce which are the search and algorithms including collaborative filtering where MapReduce is very efficient. Then there's what we call map collector where through the MapReduce will spark and things like that comes where you have a map phase followed by a collective phase and then you iterate that. There is the class covered by giraffe, which used to be [indiscernible], which is -- uses point to point communication more and also tends to be iterative. And then finally there's a class which at least in the literature are largely done asynchronously with thread-based algorithms running on large shared memory machines. So those five execution models sort of cover those 51 use cases, at least I would say they do. All right. So this is a chart I've shown for many years, the different ways of doing MapReduce. Given the first talk of the meeting, either uses and abuses, and we have map only. Classic MapReduce. Those are the first two execution models I said. Iterative MapReduce, which is map collective or map followed by large communications structure and then point to point which is where simulations are and where also giraffe and [indiscernible] are. So that's -And you can cover all of them as map communication. giraffe and iterative MapReduce, so all right. If you want to cover So you can now look at -- you can now just write down all the algorithms and classify them in these areas and beginning we only have -- we don't have so much time, we can -- so we have map only. That's local machine learning. MapReduce, which is certainly where search and query comes in, things like the large [indiscernible] summarizing statistics recommend assistance [indiscernible] pass fires, map collective where we have lots of nice algorithms, which go on to the next page. We have map communication where we have the graph algorithms and we have the asynchronous ones. So this classifies the algorithms in these execution models. So now less compare what we learned from simulation. So pleasingly parallel is important in both. I remember in 1988 I gave a talk about parallel confusing and identified pleasingly parallel as an important parallel computing model so it's still very important. Nearly many of these are single program multiple data and essentially are [indiscernible] synchronous processing and obviously not iterative MapReduce is used in several important problems. It's not a common simulation paradigm though, except for the maybe some of the cases where you do a reduce face after pleasingly parallel execution of various tasks. Big data often has large collective communication. Brocus [phonetic], things like that and of course reductions. Whereas simulations normally associated with small messages, so that's -- so the difference there. One important difference is in sparseness. Simulation is essentially always sparse except for a few examples like electromagnetic simulation. And in the case of big data we have some sparse cases like page rank and bags of words which tend to be sparse but there are lots of important big data problems which are not sparse. So there's a difference in the importance of full versus sparse algorithms. So this picture here is one I often used when I didn't use these [indiscernible] design point that if you look at the graph on the right, that's a classic Facebook-like page, friends, we are we have people linked together. That's a typical graph problem. If you look at the picture on the left, that's a particle force problem. There's a lot of analogies between particle force problems and Facebook problems. >>: [Indiscernible] except that here [indiscernible]. >> Geoffrey Fox: I didn't say they were the same, but there are sort of relations between them. There are similarities. They both get executed in the map communication model and so that's what I wanted to say. And also there are lots of long-range force problems which are -- which you gain get some analogies to in the big data problem because there's a lot of [indiscernible] if you calculate distances between points, those distances tend to define, usually confine a distance between all points as you often can, and that looks like a long-range force problem. So and so that's -- all right. >>: So -- Doing it that way, [indiscernible] order in squared. >> Geoffrey Fox: There are order in squared, definitely. There are lots of order in squared problems. [Indiscernible] a lot of clustering problems are order in squared because you can find the distance between all points. >>: So you're implying [indiscernible] big data [indiscernible]. >> Geoffrey Fox: That's what this says here [indiscernible] multiple ideas, yes, but there is variants in that. It's also -- so this tends to give you full matrix algorithms and I note that even in say like deep learning, they tend to be sparse because there's no [indiscernible] connector. You don't get connections [indiscernible] neurons but they are actually in blocks therefore matrix algorithms and that's why they run very well on GPUs. >>: There's a lot of metric analysis algorithms are not about matrix [indiscernible] ->> Geoffrey Fox: That's actually what I tried to capture up there use giraffe and map communication. If you look at the deep learning for face recognition or learning how to drive a car which is done by [indiscernible] group at Stanford, there you run very well on GPUs and that's because the [indiscernible] of that problem is matrix, matrix multiplication. And that type of -- so all I'm trying do here is I'm trying to compare what we've learned in these various problems and so -So how much longer do I have? >> Vani Mandava: About three more minutes. >> Geoffrey Fox: Okay. Well, I'll just finish then quickly. So we're trying to understand how to implement these ideas. That's our duty in the audience when [indiscernible] Rutgers and some of the work actually Dennis is helping in Azure. And so basically we use this concept of high-performance Apache big data stack by effectively adding integrating HPC with the Apache stack and one key thing we have is we've introduced below the spark Hadoop et cetera's layer we introduced a communication layer and where we tried to take the best possible communication algorithms which are often those done in MPI, so our goal is to get the functionality via the Apache stack and the performance of HPC. So you can look at that. There's lots of layers. So let me just -- so these figures here, so these are from Azure. This is Twister4Azure, which is iterative MapReduce running on Azure and this just -we have some graphs here comparing on -- for the use of collectives to improve the performance of the -- this case [indiscernible]. This is already using iterative MapReduce on Azure and if you try -- if you introduce in this case the all reduced collective and implement that in an optimal fashion, you'll get better performance. And you can compare that on broader platforms we have here. So actually, this compares Twister4Azure, these three here with HDInsight, which is the slowest and also with Hadoop using the same type supremacies and you will find that using these collectors, it runs much faster because we're doing it MapReduce [indiscernible] iterative problem [indiscernible]. It's also [indiscernible] whenever we measure it, Java is faster than C#. So the [indiscernible] HDInsight actually runs faster in the actual compute part than Twister4Azure because Twister4Azure is running in C#. But the communication part is slower. So the idea here is by introducing collectives which we implement in optimal fashion, so that's these [indiscernible] the things implemented in an optimal fashion that we can get substantially better performance and this one here is the same thing run on Linux using an optimal performance for Hadoop running on Linux. So that's that. So I think that's a reasonable place to stop. So what did I try to tell you? I tried to look at these use cases, tried to identify key features, compare those features with what we know from parallel simulations, identify sample classes of applications, note they can be run in various different [indiscernible] execution modes and then I -- the end here, I showed that for one class of those, those doing map collectors, we could actually implement these collectives in optimal fashion and then you can put that in Spark or put that in Hadoop or HDInsight, get substantially better performance, but you first try the idea is to identify the need for that type of execution pattern, the map followed by collective or map followed by communication giraffe and then realize that you can try to implement that. So the programmer just knows I'm doing map communication and they implement communication and then underneath the hood when [indiscernible] Linux HPC or Azure or Amazon, the system does the optimal implementation. That's the basic idea. Thank you. [Applause] >>: I'll stop there. [Indiscernible] HDInsight. >> Geoffrey Fox: That's the slowest by far. >>: So actually from Microsoft's point of view where you pay as you go, that would generate the most revenue. >> Geoffrey Fox: Definitely. And that's [indiscernible]. Get your performance -- performance [indiscernible]. >>: Good thinking. Were you implementing on top of MPI -- >> Geoffrey Fox: No, I'm not using MPI. We're taking the collective algorithm. If you look at the -- for the last 20 years, people have been building optimal algorithms for MPI do every single of the 230 primitives. So you look at that work and you try to implement it in the best fashion for the platform. So this definitely done say for Azure and for an HPC [indiscernible] cluster. As in fact the MPI work is done differently for [indiscernible] the different network topologies. >>: Could you just comment on whether these -- the HDInsight [indiscernible]. >> Geoffrey Fox: >>: These are running on [indiscernible]. All of them, yes. >> Geoffrey Fox: These are running on -- these are Hadoop running on an HPC cluster. These things here are Twister4Azure with different implementations of the communication running on Azure and it's got optimized communication that we did for Azure. The yellow -- the orange one is HDInsight running on Azure as delivered by Microsoft without any optimization for the communication. So it's meant to illustrate that you can take the same concept of optimal collectives, are do it on HPC, do it on Azure, get a uniform programming model, and then you could get better performance on both of them. >>: One question, [indiscernible] problems, particularly all of my problems but I suspect this is more generic, is that there's a degree of [indiscernible]. You know like you're operating over multiple images but each image ->> Geoffrey Fox: You could do that, yes. possible yes and some people do that. That's certainly obviously >>: I guess the question then when you're looking at these is if I have got 64 [indiscernible] for example, how many machines ought I divide -- should I divide that into? Should I [indiscernible] 16 core machines or 164-core machine in terms of ->> Geoffrey Fox: So I mean, this is rather crude analysis here won't directly give you that answer. It will tell you that when you want to do the parallel inside images, you're going to be pretty sensitive to communication latencies and things like that around you ought to try to use these optimize communication primitives and suggest that [indiscernible] project where she is putting us in Hadoop plug-in operations and so she has very nice results from that. So that's our plan. Our plan is to help you do that efficiently by providing -- by identifying these different models, the pleasingly parallel model which you want to use in your hierarchal parallelism and then within that pleasingly parallel we obviously know that for a given size problem, the efficiency is optimal for certain number of cores because that's communication over calculation ratio which we [indiscernible] started. So you can make those -- so if we have the right language, you can do those experiments. >>: [Indiscernible] your inner parallelism gets the same speed up with four cores as it does with eight, then you're better off with the four-core solution and [indiscernible]. >> Geoffrey Fox: That's right. >>: Cost benefit analysis. >>: And the cost [indiscernible]. >>: That's right. [indiscernible]. >>: So the 16 core machines that are available on Azure that We don't have that system. >> Geoffrey Fox: I'm sure we can build and auto tuning solution using atlas or some variant of atlas Azure [indiscernible] collectives and Azure the costs of the areas instances and will tell you -- in fact automatically deploy it on the right instance. Clearly one can do that. That's [indiscernible] that's sort of engineering. >> Vani Mandava: So we have one more talk after which there is break so if there's more questions, maybe we can take it in the break. We have one more talk right now and then break. Thank you, Geoff. [Applause] >> Vani Mandava: So our next speaker is Jeff Zhang. Jeff Zhang is a program manager from Microsoft Excel. He is in charge of design and implementation of Excel data and analysis features and tools on the Apple platform. Recently he steam shipped offers for iPad, which I'm sure you have all heard about. Prior to Excel, he worked on Microsoft Exchange Server and before joining Microsoft, he was and assistant researcher at the Chinese Academy of Science. Please welcome Jeff Zhang. >> Jeff Zhang: Good afternoon. It works. So today I'm going to show you a short demo about power BI Excel. We talk about Excel just now. [Indiscernible] generally Excel is -- let me see. It's not coming up. Is it up? Okay. So we talk about [indiscernible] a mix of love and hate. Actually similar [indiscernible] product but [indiscernible] today's talk, short talk. [Indiscernible] problems and tools that we [indiscernible] in Excel it adds some more love to this product. So as I remember, I think this is a true story in the research area announcing China [indiscernible] as approach graduate students professors always assigned me tasks do a lot of like data crunching and data processing work. Of course those are the works that are kind of necessary to come up with good insights based on massive and various data. However, as a researcher, this is not really something that's really that delightful in terms of using our brain power. So what we really want so to use a portion of our time to process the data and probably also come up with the easy way to provide various angles to view our date so that we can get quicker data insight. So really, the Excel power BI tools is for that purpose. So we say that the more data we have, the more power we have but that is only true when we have the proper tools. So Excel power BI is actually a set of features or add-ins including the Power Query which is the engine inside Excel and desktop which has the capability of processing millions [indiscernible] of data and second wise [indiscernible] procuring so data processing starting from grabbing data from somewhere that you [indiscernible] or formatting the data to a format that you want to be represented so Power Query really is the tool in place for you to help you to quickly achieve that purpose. Power View and Power Map, both of them [indiscernible] tools in Excel help you to be able to view the data on a map or on a dashboard which enable you to [indiscernible] your data from various angles. So my demo today start with [indiscernible] which some of you may already see yesterday. So this is a single video that I generated from data that I got from one of our researchers which shows the Guadalupe River stream flow around the river on a different locations. So the time series is from June to November, I believe. And through this single map we can tell across this format what a stream flow look like across [indiscernible] as well as the locations. So some of these [indiscernible] from this data. For example you see the data in October there's a various obvious spike of the data sometime like here. So without a visualize data like this, I may be able to do that through some pie chart or line chart [indiscernible] data from a pure raw data but by presenting it this way it's easier for me to grab insight and ask around it's the reason for this back of the stream flow over that era is because of the [indiscernible] only happening around this area on a downstream of the Guadalupe River not upstream. So this is a very, very simple example [indiscernible] idea that I'm going to show you. Go back to Excel. So in this several spreadsheets, you see the data like this, those are the raw data that I got from sensor alongside the locations that you just see on the Guadalupe River. So it has the name of the sensor and the longitude and latitude which is the special information and also daytime and assume flow of data. So you if you took a closer look at the data, this data is just a raw tax or stream format directly from the device. The device may design the way to record data that makes sense to the device to understand or to use. However, as a common purpose to represent data [indiscernible]. This format is really useable for visualization. So we make a format sustain the data in time format in the table of data [indiscernible]. What I can do here is to use the tool I just mentioned, power your query to process the data. What I'm going first is to create a table out of this data and then once you have the Power Query stored on your machine, and then Excel you want Power Query tab shows that you can get the data from various resources or the databases that you have or preprocess from some terabytes data into the gigabyte data into your different database also not only Microsoft ones but also Azure data marketplace, even Hadoop. But here, just to show the power of the data processing I'm just importing this data from this single table. And there [indiscernible] Power Query UI, on top of the UI, you have series of the functions that you can use to process data inquiry and like [indiscernible] or like crunching data by reduce different kind of rows and columns. On the right-hand side [indiscernible], you see a query name and all kinds of [indiscernible] that you can just replay. So here is also the grid that we just saw from Excel. What I'm going to do now is actually just to extract data information from the first column so without this tool what I would do is I use a formula to find -- probably find this [indiscernible] and then extract the [indiscernible] before this character and apply that to all the cells in the first column. However, this although is not complicated, but still, every time when I want to do that it takes me some time to come up with this formula. Use Power Query, there's a single function called split column. So by [indiscernible] first column, split and by the [indiscernible] I can use different [indiscernible] but in this case I want to use [indiscernible] which is T and a split from the first row and then I see that the [indiscernible] asked me what [indiscernible] into two columns and the first one has been changed to the date and the second one is time. So that's automatic step taken which is to other than just purely split the data up from one column into two, Power Query also signed the date recognized and sign the data time as date for the first column and the second one is time. So just to because I just want to represent the data across like several months to see -- [indiscernible] of the data for like every 15 minutes a day. What I want to see is really maybe every day, what's the average stream flow. I can also do that through some of the formula or even macros in Excel to aggregate the data. However, Power Query also offer me the ability to be able to do such aggregation by very simple clicks. So I just need to use [indiscernible] and say [indiscernible] the information into the first column. What happened? The first column, which is the data and time forum column, and there name a new column as the stream flow average. So don't look at it. Listen to me. >>: [Indiscernible]. >> Jeff Zhang: [Indiscernible]. Okay. It's back. Good. So I need to aggregate the data, actually you see in the third column, which is the stream flow, I choose the stream flow column. I click okay. And then you see the data that [indiscernible] is only by date and also the average data in that day. And remember, we still have longitude and latitude data what I want to do next is to lay all that data into one put that data on a map. So I can, by doing that, I can column to the table. This is the longitude information I just want to click okay and you see it's very easy. in original table, so table so that I can simply add a new that I remember. And These are the data that I processed in Power Query. You see that these are things that you may be able to do in cell by cell either by formulas or through function, using macros. But here just a few clicks within a couple minutes I can process that data in one new table. This is just for one sensor data but you can imagine that also do the same thing for the sets of data in different sheets. So [indiscernible] ask is how can I, after I process all of those data and I want to represent them into one chart, so the next step I need to do is really connect or append those data into one big table. How can I do that? Actually I can also do that Power Query. So these are data that I really want which is in one table will have all 7 sensor data into one table so what I need to do is actually just to bring back the data that I already have using the same tool which is Power Query. Bring back Power Query and that's a very obvious icon we call append. By clicking append, we actually append the various datas that we just processed into one set of data. Then [indiscernible] on that. [Indiscernible] the name is not the same and [indiscernible] automatically append them. So that's how we use Power Query to process the data. Next what I'm going show you is how to visualize them using the Power Map. So Power Mapping is a separate [indiscernible], so once you have that [indiscernible] in store, you will see a map pattern in the Excel tab. [Indiscernible]. Click on map, and then you will bring up UI. Over here on the top is one -- the visualization that I already generated yesterday. Now I'm going to show you how easy it is to generate a new one based on the same set of data. Click on the new tool. And then very quickly, Power Map just automatically can understand that I have latitude data and longitude data in my data set and can use that as a reference and click on next. So the right resolution here is really too small. So it's hard to see. It's already crashed. Resolution is too small. Try it again. Create a new tool. And then when you click on create a new tool, Power Query will start to digest the data, look through all the data that you have in this table and see what data it can use as a special reference and automatically set that [indiscernible] special reference data. Longitude, latitude. I think we have some problem over here because resolution is really too small. I'm serious is right. Category is right. And stream flow I just need to use no aggregation for stream flow. And I still need [indiscernible] longitude and latitude probably cannot do this [indiscernible] just because the resolution is too low that slower than past display for Excel. So I just keep this one but I can show you later on my machine. The last part I want to show you is another visualization tool that I just mention the which is Power View. So Power View is essentially another type of [indiscernible] that you can have in Excel which [indiscernible] dashboard let's you be able to put different visualization on one sheet so that you can view data from multiple angles. For example on this spreadsheet, this is essentially spreadsheet but this dashboard is like a special spreadsheet that I show you map visuals but this one has a map visual and also line shot. Best part of this is not only that you can see data both special data and [indiscernible] data on one view but also the associated and you can explore the data for example if I only want to see [indiscernible] data, I can click on this [indiscernible] data. And then no [indiscernible]. So I can only [indiscernible] down data for a special [indiscernible] alongside [indiscernible] to see the trend of the stream flow. Just to show you an example that I just enable single data [indiscernible] for example if I only want to see [indiscernible] marine data I can use fields for it's data by the name of the marines show only [indiscernible] three on this map. Very easy. And then I can see a pattern of the stream flows. It's actually very similar to each other. So ask myself a question, why they are similar, and then I look at the left-hand side on the map so just geographically, these three marines are very close to each other, green [indiscernible] and [indiscernible] and they may share the same water [indiscernible]. That may be the reason. They have a similar pattern of stream flow, so I just explore the map a little bit more and see there's a snow mountain over there, Mt. Ranier, very famous around this area, and then I see that the stream flow come the most during September. Maybe that's the time across the year that the Mt. Ranier melt most. That's again got the most stream flows in this area. And you see although the pattern is similar but in [indiscernible] the stream flow is almost double, even more than the stream flow that as are between green and [indiscernible] so I can make a guess maybe the streams just come from Mt. Ranier to [indiscernible] and then come back to [indiscernible] and go to Puget Sound. So this is just to show with this interactive dash board by viewing the multiple am data on one view, really helps you to get quicker data insights. The data itself is very simple am and also you may think it's very complicated to produce these kind of dashboard. Actually it's very simple. So what I would do here is just go to -- still go to the insert tab and if you have the power BI [indiscernible] gather with your Excel, then you will see this Power View more than over there. Click Power View, and then you will create a new one, a few tables that [indiscernible] spreadsheet which is essentially [indiscernible] as the ones I just showed you for the marine data on the quadro river so I just need store the data on to the sheets by checking all of them. And this data is basically a raw [indiscernible] data for the marine metadata and then I can change the design of the data by saying instead of using a table, just represent on a map. Click on the map button, and change the set a little bit, longitude, go to longitude, go to latitude, and [indiscernible]. So now I can see all the marines on the map so I did [indiscernible] map and the Power View map view, both of them use the big maps service at the back. And how can I add the data like the [indiscernible] on the right-hand side is also very simple. I just need to store the data on the [indiscernible] again by adding all of them and then go back to design. Takes some time. This projection really makes things very hot. So the idea here is just still you see that I seldom use my keyboard to type in anything across the process that produced these visions but just to show the power of the visualization and also data processing tools in Excel, you can imagine [indiscernible] tools can save you tremendous time to get the data insets. So the demo is not as smooth, but some of you already saw that yesterday in this demo, so you already [indiscernible] without this slow projection resolution, but that's what I'm going to show today. [Applause] >> Vani Mandava: Any questions? >>: So earlier today you saw integration of variation [indiscernible] R and Python with Visual Studio. Can you imagine such a scenario of that type of about integration with the Excel environment to get more direct access to data? >> Jeff Zhang: Yeah. I think those are very good questions. So we talk about macros just one type of the most important [indiscernible] for Excel and other office apps and another way, similar way program manager, programming [indiscernible] just mentioned, another programming model that we are trying to build with Excel 2013 and going forward is called apps for Office which [indiscernible] essentially is a [indiscernible] extension that's been built inside Excel. For example, with that feature -- let me see if I can show that here. Not guaranteeing that I can show that successfully. You see that this [indiscernible] pattern over here, maps or pure graph, so those are the new extension matter that we are enabling starting from office 2013. So those essentially are the services based on a whack programming language on a service side and [indiscernible] that you can like follow some of the metadata method that office requires. You can put them as part of the content in Excel [indiscernible] or as a [indiscernible]. Let me see if I can just try that out very quickly. For example, this is a new [indiscernible] app or view this as a new extension for Excel. So the content in this app is essentially web page, a JavaScript powered web page that carries the data from Excel or right into Excel by interacting with the user's behavior in this view. So you imagine this is a very simple chart, people graph, but it also enables the map API and also any other service that we already build for other places as long as it can be exposed somehow on a web page and you can just invade that portion of the [indiscernible] part or the visualization part into Excel and present the gather we use the data inside Excel expression. So it's called apps for Office. Go back to check it out. >>: Talk briefly about how you shape the data and bring the data to just bring down what you want in the graph but [indiscernible] spreadsheet. What is the current functional limit for number of rows in Excel? >> Jeff Zhang: That's a very good question. So [indiscernible] number of rows has a limit of [indiscernible] integer, I think, number of rows. But with power [indiscernible], this is something that's not exposed to the end user but power [indiscernible] as it is the power [indiscernible]. Power [indiscernible] you can imagine is a separate engine data compressing and data indexing engine inside Excel and the all the data that's been processed by power [indiscernible] is starting Excel not in open Excel format but instead [indiscernible] format so the power of [indiscernible] is when Microsoft used the same engine or data indexing engine on both [indiscernible] service in the [indiscernible] and on a server and which the same bits of code we put that into Excel and the limit, the real limit of the number of data is just by the memory on your machine. So there is some demo which is not the demo for today but there's other demos check out via demos, so there's a [indiscernible] or the task for the calculation or the query for the table that has more than one [indiscernible] rows inside Excel and the speed of the query is less than one second. So that was tested on a very old laptop running Windows, Windows [indiscernible]. >>: I was wondering -- I had a few questions here. So the first one is when you save like you're doing these visualizations there and you're mapping the different columns, when you save state, are you getting all of that saved and like you know, so when I come back, that's exactly what I get with my visualization there in whatever form I had it? So that was my first question. The next one was when you said you could import from HTFS, is that using like web HTFS and what's like the authentication mechanism if you know because like HTFS by default like likes curve rows and doesn't like a lot of other things. Is it using like the web HTFS restful API to grab data? And then will it work on my Mac, would be another question. >> Jeff Zhang: Good question. >>: And then sort of playing on the first question about saving state, do you see building anything into here to allow sort of auditing of what you've been doing in the trend towards like research reproducibility if I had some kind of audit trail I could pass over this to somebody else and they could see my steps and be able to reproduce what I've done and maybe tweak that if you're trying to build Excel into like a data analysis environment? >> Jeff Zhang: Right. So first question, auto visualizations and data processing that is stored in Excel, as long as you open that Excel you can see that from another machine you made from different users all of though information is stored inside Excel as long as you saved it. Second, the Hadoop server, so you just need to type in the server name and then there's [indiscernible] I'm not quite sure but what I [indiscernible] before is just purely anonymous sign in for this Hadoop server I think the [indiscernible] should have been supported within this Power Query tool. Of so I'm not quite clear about the exact authentication matrix that's supported but I can go back and find it out. Yeah. Just shoot me an e-mail and then I can. Third thing, third question is whether those things are coming to Mac. So we are actively working on that. We are very busy working on office [indiscernible] so next is office format so hopefully it can come soon. The third one is the [indiscernible] steps that we measured. So [indiscernible] today we show that very clearly demo because of the this projector thing but every step that you conducted in Power Query is recorded [indiscernible] like easy one step after another in pro steps that we can pre pro. Other than that, that's a script at the back that you can also see and modify just by if you know the programming language for Power Query, you can also do it with that in the programming way where you share data to another person, they see all of those as well. And one more thing that I didn't talk about is in Power Query, if you sign in as a [indiscernible] users then you cannot only share the data that you just processed to [indiscernible] that others can see or coworkers can see, but you can also do a search for the data that's from others. For example I have a [indiscernible]. You use his name a lot. Like just [indiscernible] some data that he processed. And this is the data that he processed before. So I can see all of those data and if I double click I can bring up the Power Query, I can see all the steps that you cannot see here but I can see all the steps that is how he processed the data [indiscernible] reuse or modify for my future use. Okay. Thank you. >> Vani Mandava: [Applause] Thank you, Jeff. >>: [Indiscernible] in the back of the room and we will have our wrap-up session [indiscernible].

32352 >> Vani Mandava: So our next session is something... have had to deal with, every single one of us...

Related documents

Products

Support

32352 &gt;&gt; Vani Mandava: So our next session is something... have had to deal with, every single one of us...

Related documents

Add this document to collection(s)

Add this document to saved

Suggest us how to improve StudyLib

32352 >> Vani Mandava: So our next session is something... have had to deal with, every single one of us...