32352 >> Vani Mandava: So our next session is something... have had to deal with, every single one of us...

advertisement
32352
>> Vani Mandava: So our next session is something that I'm sure all of us
have had to deal with, every single one of us in this room. In all our
research that we do have to deal at some point or the other on how to extract
information from data.
Our first speaker is Shahrokh Mortazavi. Shahrokh is a partner program
manager in Microsoft. He currently works in the developer division of
Microsoft on [indiscernible] tool chain. Previously he was in the
high-performance computing group. He worked on the Phoenix compiler tool
chain which is co-gen analysis and [indiscernible] tools at Microsoft
Research. And before that, for ten years, led some micro systems
cogeneration and optimization compiler [indiscernible] teams. Welcome
Shahrokh.
>> Shahrokh Mortazavi: Thank you. All right. My goal today is to keep Ed
Lassiter awake. If I can accomplish that -- no. He's my mentor.
>>:
[Indiscernible].
>> Shahrokh Mortazavi:
him.
He's my mentor.
So originally I was going
and data science. Then I
it and go from there. So
decent platform for doing
order bit.
Everything I know I learned from
to give a talk on the whole workflow of E-Science
said, you've only get 15 minutes. Pick a slice of
I hope to show you guys that, hey, Windows is a
data science and E-science. That's really the high
I looked around on the net. There was a whole bunch of really cool workflows
for data science and E-Science. And I looked at the list of developer tools
that they mention. Here's one of them. You see Excel, Java, Python, R,
Wicka, Hadoop, of one or another, Flume and R Wicka and D3.j, everything.
So I looked at some more. At some point they started to basically start
repeating some of the same tools in program language, visualization, big
data, MapReduce stuff, storage. So buckets would be basically gathering
data, cleaning, lunging, storage-related stuff, analytical tools, the
language and scrip thing, which is mostly what I want to talk about, and, you
know, productivity things like IDEs and so on. And of course for this
audience, publishing and sharing is probably up there.
The very first one was Excel. You know, love it or hate it, we just did a
survey of a whole bunch of data scientists, Excel was one of the top tools
that they still use. And there's a love/hate relationship. It's everywhere.
It's got pretty decent import and export stuff. It is scriptable. You can
write UDF if you want for running RTC code.
There is a web version of it now that's pretty good. If you've been using
Excel spreadsheets, you should really take a look at this. It's pretty
decent, but it's got some warts, too. When it comes to extremely large
inputs, this is not your tool.
Version control has always been an issue. How do you version this stuff? Do
you just change something in the name or do you use GIT and save it as XML
and do something else? And if you want to do anything, any logic behind it,
VD script is pretty much your only choice, up to now. And the functional
library, when compared to all the stuff that are available in scipy and
other, you know, R-related libraries is somewhat limited.
But there's some great add-ins coming up that I think some of our colleagues
will demo later that give it a whole new take on analytics that you might
still find useful as E-Scientists.
One of the things that people really dislike is the fact that you're stuck in
VB script. These days, there are a number of libraries such as pixel and
pivot and others that allow you to extend what you can do in Excel as a UI
tabular focus tool and use stuff like Python with it.
So for that, I'm going to do a quick demo just to show you.
This is, by the way, PTVS, Python Tools for Visual Studio. It's an
open-source plug-in for Visual Studio that gives you C# level support inside
Visual Studio. You can download it and stick it in VS. It's free and does a
bunch of magic for you.
Specifically here I'm going to do import Excel. And let's see. Excel dot
view, and get nice completions and so on. Let's say do a random hundred.
And here you are, in Excel is a live bridge between Excel and Python. Do
what you want in Python that Python does well. Do what you need to do in
Excel that Excel does well. Use the right tool for the right things.
Here, let's say you want to insert a chart. Don't recommend a chart? Oh,
what do you say, Excel? Ah, nice, scatter plot. There you go. So you can
essentially create a numpy 2D array, 2D matrix, and then zap it on to Excel.
Do a bunch of filtering and so on, and then live objects show up back in had
Excel.
You could also do things like, let's say, if I just want to just show that
within Visual Studio, there it is. So you can actually have inline graphics.
This is using the iPython back end to actually send the code to a Python
interpreter, get its response back, which could be text, it could be visuals
of PNGs or videos or whatever, and display it right there inside Visual
Studio. Okay.
So let's see. Let's go back to the slides.
addressed, is the key message here.
So Excel, the pain points can be
Moving over to scripts and languages that you might use for E-Science, you
know, there's a whole bunch of scripting languages. Python and R are the
dominant ones in the zip code, M in MATLAB if you are a MATLAB user or maybe
an open source version of it. Octave might be your thing. Julia is coming
up as a better version of N and Python combined, definitely faster in terms
of execution speed.
And as far as standard languages, the go-to languages are Java, C++, still
some Fortran. And if you're on Microsoft, you definitely want to check out
C# and F#, which, by the way run very nicely on Macs and Linuxes now.
There's a new compiler infrastructure called Roslyn. Essentially it's a
rewrite of the C# and compiler family, and now it's open source and available
on CodePlex, and the good folks at Xamarin are making the whole thing
available on Linux as well, so you can even write your Android or iPhone app
in C# and write in that environment.
So speaking of Python, which I work on, so a little bit of bias here. What's
equality and availability of the tools stacked on windows? I would say
excellent with a little asterisk. If you're installing Python libraries that
rely on C++ in parts of it and you don't have that, it's a painful surprise.
So for that, I suggest you stick to reputable distros such as Anaconda and
Canopy.
These are some screen shots of some Python IDEs that are available out there.
My favorites, surprisingly, are Python Tools for Visual Studio and IPython
from Fernando and Brian and Mihn [phonetic] at Berkeley and Obispo, I
believe.
These are cross-platform. PyCharm, Wing, Komodo, PyDev. So if you are in an
environment that you need to work on Mac, Linux, and Windows, forget about
Visual Studio. Go ahead and use one of these cross-platform ones and keep
your life uncomplicated.
As an E-Scientist, the core setup that you need essentially is a CPython
interpreter 2.7 for maximum compatibility or 3X for the latest language
features.
Scipy, numpy are your friend, and pandas now has a library for essentially
doing -- giving you the data frame capability that R provides you.
And then IPython Notebook which I'll demonstrate later.
Like I mentioned scipy just to give you a -- in case you're not familiar,
Python is pretty much -- let's see. Scipy -- become the lingua franca of a
lot of scientific stuff. So whether you need to do wrapping of code and
calling to native, plays well with it. Whether you want to call out to
MATLAB code, visualization 2D, 3D, optimization libraries, parallel
computing, there's MPI for py if you want to do MPI programming. And
particular domain libraries, if you're in astronomy, if you're -- want to do
basic statistics, biology, I mean, let's take one of these groups, let's say
biopython.org and there's like very large active communities that have very
deep and complete domain libraries. So Python is your friend.
Next is R. We just did the survey that I mentioned, the two languages that
pop out more than anything are R and Python. So again, it's high-quality
implementation available on Windows, single click install. Very active
community. Lots and lots of libraries that you can use. And if you're doing
a lot of stats, it's still the number one tool, the go-to tool.
Visualization is also probably its forte still over Python.
In terms of IDEs and environments, IPython Notebook that I mentioned has been
extended to support R directly. You can also use the IPython magic command,
essentially percent-percent, call out to some foreign language and it will
bring the data back in. So again, you can stay in the browser.
And there's R Studio, which again is cross-platform. If that matters to you,
stick with it. It has in-line graphics just like PTVS. And Revolution
Analytics is another company that if you need to be having an 800 number to
yell out when things go wrong, they have an enterprise version that's
available.
Fortran, yes. Believe it or not, a lot of people that are doing -- yes.
there's one right there. It's not dead yet, and it won't be.
Yes
If you're doing number crunching and you want the absolute best compiler
optimizations for your inner, inner, inner loop, and the best libraries for
FP crunching in general, if you want to do NPI, if you want to do open NP,
it's still the go-to language.
What's availability on Windows? It's pretty good. Not as many options as
there are on Linux, but the two big ones are Portland groups, PGI
environment, and Intel's Fortran. Both of them, if you are in Visual Studio,
plug into the VS environment and provide you the full goodness and sweetness
of debugging inside VS.
Then there's mat libraries that sit underneath all of this stuff. The
commercial ones are NKL from Intel and ACML. There's Rogue Waves and there's
like the Visual Numerics that became visual Rogue Waves for .NET libraries.
But essentially, all the important ones are available in Linux are available
on Windows as well. And of course, R and numpy and scipy have their own set
of stuff that are available where they can directly tap into for a particular
domain. So essentially, the full mathematic and numeric stack that's
available on Linux and Mac is available on Windows.
Next, as a person that's slinging some code for your models and for
simulations and so on, especially I see a lot of Macs here and I assume some
of the other machines are running Linux, you know. If you're coming to
Windows, you want to bring your tool set, right? You don't want to go learn
a whole bunch of new stuff, especially if your stay in Windows is short,
right?
So VIN [phonetic], E max [phonetic], sublime, whatever, all the top, you
know, 5, 6, editors that people use, high-quality implementations are
available on Windows. So when you all tap to Windows, you don't have to do a
massive mental shift.
There are a bunch of lightweight IDEs that come with these languages like
pyscripter and IDLE for Python that you can directly use it with pretty much
any distro. And there's some very high-quality web-based IDEs coming up like
Visual Studio Online. You can use it from any browser, any OS, do your
editing, do your GIT, and push stuff to Azure if you need to, but all the
core features are there.
And there's a whole family of full IDEs such as Eclipse and PyCharm and
Komodo and Wing that are cross-platform. Again, if that matters to you, go
with them. If you are mostly on Windows, check out PTVS. Very powerful
environment.
On R, R Studio, again, is a beautiful environment that runs everybody.
revolution analytics that I mentioned.
The
And Fortran and friend and VGI have very nice plug-ins for Visual Studio as
well.
And if you're doing .NET stuff, Xamarin Studio again is your friend. Runs
everywhere and gives you a whole bunch of support for environments that you
wouldn't think normally such as the iPhone.
The up and coming quote/unquote IDE that everybody, if you're in E-Science,
should check out is IPython that I briefly mentioned. Think of it as a super
ripple that gives you history, that gives you completion, that gives you
syntax highlighting. In the notebook version of this, so over here, this is
the terminal version. This is the QT console that runs everywhere and this
is the notebook version of it that runs in any browser. Basically, you have
two -- I think I should just demo this.
So here's one that I quickly set up or you can set up, let's say IPython
Azure. Let's see what shows up. Oh, great. So if you go on Azure, you can
very quickly set up an IPython Notebook using Linux or Windows VM, again,
because Python is cross-platform. It doesn't matter. All the software is
available.
A few commands, and you get yourself an IPython Notebook. The interesting
thing to note here is that you've got math jacks rendering of your formulas.
You've got code cells. You've got markdown cells, and this stuff gets sent
to a Python interpreter. It could be R interpreter. It could be Julia.
Doesn't matter. And whatever comes back get nicely captured and displayed
for you.
So here, here's a machine learning demo that basically is fed a bunch of
faces and then train and then see if we can recognize some new faces thrown
at it. I click here to create a cell. It's just your regular Python. I
just did a shift enter. Let's say I say X equals lin space of zero to 5.
Give me 20 samples. So there at this time, I can do X squared. Shift enter.
Maybe even plotted. X versus let's say sign of X squared. This looks ugly.
Let's get more samples. Much better.
So the beauty of iPython, whatever language you end up using is that I'm
using -- I was actually going to bring my Mac and do this on Mac using Safari
talking to a Linux back end on Azure just to make sure there's no Windows
involved, but I have a PTUS demo to run versus installing a whole stack and
telling it install Windows, update this and .NET Windows, .NET update
version, data, whatever, the installation now is here's a URL and a password.
You're done. I already set up everything, maybe even provided you with a
click install, a single click VM install of everything you need for your
particular scenario.
So and on top of that, there's a whole set of outputs that you can capture if
from IPython that allow you to get a static HTML version of your notebook
such that people can just look eight. So it's a Ruby notebook, is it Julia?
Let's see. One that I should just flip through. So this is an HTML capture.
Again these are markdown codes, this is executable code that I can shift
enter on, I can move around, I can edit them. But basically, the whole
concept of an executable paper that you can change a URL and I can take your
algorithm and then run it against my data, this is the closest that people
have come to that's available for everyone open source, free, and with no
weird gotchas. Okay. So that's iPython.
So Python Tools for Visual Studio, I talked about it a little bit already.
The things to mention is if you're teaching, you can pick up the express
edition. It has full Python support in it for web programming and for doing
computational stuff. And when I say Python support, I mean any interpreter,
whether it's CPython, IronPython, which is Python on .NET or Jython, which is
Python on JVM, et cetera. It's got cool features like remote debugging on
Linux.
So we have a lot of customers that actually develop on Windows. They find
development on Windows to be productive and they actually run their code on a
Linux forum. So this allows you to actually set up break point right there
on your server and then have the goodness the Visual Studio right there.
I sort of mentioned the integration of IPython. You can, say, select a bunch
of code, send it to the reppel and then you get nice graphics. So you can do
a very nice interactive development of your algorithm. You can switch to the
editor and set up a break point. Let's say if my conditions are met, good,
and print my process IDE and machine or if I have a -- put a bullion
condition there and then run the code. Maybe putting a break point in a loop
was not a good idea.
But anyway, you can use the various -- so essentially look at it as a
MATLAB-like environment, but built around Python and all it's libraries but
inside Visual Studio. Right? Everything you do in MATLAB with a prettier
language. And free. It is a pleasant language. I mean, MATLAB is great. I
love MATLAB, but boy, everything better look like a matrix if you want to be
happy.
Another cool thing it does have mix-mode Python and C++ debugging. This is
the only environment on the planet that does that as far as I know. Python
is interpretive. It can get slow. You run the profiler inside Visual
Studio, it shows you where things are falling on their face and you write
that stuff in C++.
The good thing is Python plays well with other languages, but you know as you
go multiple languages, problems start and you need a debugger that knows both
environments. In fact, if you look closely here, do I have a laser pointer?
So this is Python, that is not Python. That is C. That is a mixed mode
stack. And if I had done this properly, you would have seen that this is
Python C++ Python stack right there.
So if you have written Python C extensions, you will probably get very
excited.
Oh, that's the other thing. What [indiscernible]? Look at that. So you can
actually run your code on a Linux server and then just remote attach to it
and debug it and [indiscernible] your code that way.
So next one in that bucket was visualization. Again I would say things are
improving on Windows. The workhorses are obviously ggplot2 and R Basic
graphics and then for Python or [indiscernible] for 2D graphics and it's got
some support for pseudo 3D graphics. More than stuff are Bouquet from
Continuum Analytics. Plotlib essentially gives you a very nice dynamic.
Actually I think I made this a link. Let's see if that works. Wow it did.
Even in IE.
So dynamic in-browser charts which could by the way be embedded inside
IPython notebooks. Okay. So you can imagine IPython notebooks that you have
something you want to run that say, you know, some progression of some
disease versus year and it can give you your user lever that they can
actually move with their mouse and it changes the charts dynamically. And
there are things like mayavi that provides you with a nice 3D support. Okay.
As a developer, one other issue that comes up fair amount that I wanted to
touch on is you're a UNIX guy, you're a Mac person, and somebody says you
need to go build this for Windows and then you get some C++ code and say, oh,
man, I don't want to go and learn PowerShell and this and that and Visual
Studio. What's the minimum number of calories I can spend to go in the
lion's den and rush out?
So first I would say if you end up doing this repeatedly, high recommend you
install Visual Studio. It's not that bad and learn PowerShell which is like
a really good command line shell, somewhat batched like it's got pipes, got
redirection and instead of just sending text around, you can send objects
around. .NET objects [indiscernible] spreadsheet in your pipe and do an
inspection on it.
But if you don't want do that, there's two nice option. Nice in quotes. One
is Cygwin whereby you take your source, you bring it over to Windows, you
install the Cygwin environment, which by the way will give you a bash and a
bunch of Linux utilities, you feel right at home with a dollar prompt, and
it's going to DLL that intercepts posits calls and turns them into Windows
API calls. So that's one way.
The other one is MingW which is again, both of these are on GCC, but this
truly gives you Windows binaries. Instead of having its own run time like
Cygwin, the actually talks with a V C++ native run time so you get true
Windows binaries.
So those are two ways that you can minimize your learning calories, if you do
this occasionally. So create a Linux-like environment, temporary, build your
stuff and get out.
Cloud, again, I don't think there's enough time to talk about this stuff, but
basically the new world is pick an OS, any OS, pick a language, any language,
I don't know if you can read this but here it says -- even I can't read it.
Ubuntu, syntos [phonetic], a bunch of Linuxes that you can choose from, a
bunch of Windows OSS. In the languages, I think .NET, Java, no JS, PHP,
Python, Ruby, and a bunch of mobile and media stuff. So whatever flavor OS
or language you want, it's available up there. But the most important thing
is the SDKs, for every language there's a nice SDK that installs on Mac,
Linux, and Windows. And you can use it to the call the APIs.
And for the rest of the stuff, which I don't have time to, but big data, big
compute, analytics, there is a solution available. HDInsight is our spelling
of Hadoop.
So I went through all the tools, and I chose just these two but I took 3, 4
of these. I went through every single one of them and did a check on their
website to see where they are in terms of availability. Actually even this
one should have a blue check mark.
Green is available and is pretty good. Blue is it's not available, but
there's an equivalent available on Windows somewhere.
And all of those were available except AWF.
So place that with Azure.
So if you want to do E-Science on Windows, your core stem cell stuff, you
know, your tools and languages, all of them are available.
My recommendation, stick with OSS tools, especially if price and sharing and
community, all that stuff, matter to you. And use it for proprietary stuff,
whether it's power BI, Excel or whatever, if it gives you a distinct
advantage. Strange coming from a Microsoft guy, but that's my
recommendation.
And I think that's it.
[Applause]
>> Vani Mandava:
Any questions?
We have time for a couple of questions.
>>: You mentioned [indiscernible] from Excel. Seems that you could do
[indiscernible] and one of the things that [indiscernible] statistical
analysis [indiscernible], the advantage of having [indiscernible] for every
column, for every variable. Just was thinking that Excel would have
[indiscernible] that metadata [indiscernible]. You would be able to use a
lot of the R models and [indiscernible] models [indiscernible].
>> Shahrokh Mortazavi: Yeah. I know there are just like the Python bridges,
there are some R bridges between R and Excel. And Excel has full COM API
that you can take stuff out, munge [phonetic] it around and stuff it back in.
Even if the core Excel doesn't support something you want, but if you have
these two working together, I'm sure there's a way to make them live under
the same roof, maybe sleeping in separate beds, but you know. I would have
to understand your case specifically but I've seen amazing stuff done.
There's a gentleman doing power PI I believe demo, so [indiscernible] show
you some of the stuff they can install extend the core features of Excel. So
my guess is it's possible. I would have to understand your case much more.
You had a question?
>>: So one of the [indiscernible] is not so much the beauty of the language
but the availability of ->> Shahrokh Mortazavi:
Toolboxes, yeah.
>>: And that currently is just well beyond what R does and what's available
in Python. And so I guess to what extent does almost everything you say also
apply in MATLAB?
>> Shahrokh Mortazavi: So I would say the slope of activity in MATLAB is
this and in Python is like that. You know, Python catch is catching up at a
rapid rate and if you look at side kick learn, if you look at the stuff like
I showed you in the topical software, there are like if you're doing Simulink
stuff like signal processing stuff, you know, MATLAB is definitely ahead, but
wherever people are -- the new generation of researchers and students and so
on, where they're putting their calories as I've seen is not in MATLAB.
They're writing new libraries. They're contributing to these libraries,
improving their quality constantly.
So I would look at where Python was five years ago, ten years ago, and look
at where it is now, and MATLAB is still a beautiful, well-integrated, nice
environment, you know. I wouldn't say that there's an equivalent of that in
Python. I would say probably PTVS is the closest thing that's trying do that
but that's a Visual Studio for engineers. That's what MATLAB is. We're
coming from an environment where it's Visual Studio for general programmer
and we're trying to crowbar it into an engineering tool.
But richness of libraries, yes, there are some things that are better in
Python. There are some things that are better in MATLAB. I think the rate
of improvement in the Python world is greater.
Not sure if I convinced you, but that's what the data that we've seen those.
Question?
>>: [Indiscernible] good for scientific computing. I've heard some
people -- I'm obviously using my Mac [indiscernible] people using the Surface
Pro to develop especially in distributive systems because you know a lot of
it is just [indiscernible] machine and you don't need that much power where
you are. What are your thoughts on that? Are you guys trying to push that
as something in that sort of developer realm? You plug it in, you go to a ->> Shahrokh Mortazavi: Very good question. Basically, it was hey, I've seen
people that use a Surface Pro and SSH to whatever they need and do their -essentially as a lightweight terminal. Absolutely. We're trying to go one
step beyond that. That's why we put a lot of effort into Visual Studio
Online which is a lightweight IDE with full GIT support and IntelliSense and
syntax highlighting for many languages and you can have your projects, you
can track your bugs, you can, you know, stuff -- open in Visual Studio when
you need to.
So we understand that the new world of pick an OS, pick a language means I
meet be coming very much from a non-Windows environment, right? So I need to
be able to, from Surface or my MacBook Air, which I do from home, you know, I
just go online and my world is in Azure. So that's a very legitimate way of
development.
It's just that what happens when your code gets really complicated and you
need to do profiling, you need to do debugging or you have got 800,000 lines
of code, we're actually at PyCon and one of the banks there told us they have
3,000 developers, 60 million lines of Python code. Like, whoa.
So there are different people with different needs, but the one that you
described is a definitely growing audience.
>> Vani Mandava:
Thank you, Shahrokh.
>> Shahrokh Mortazavi:
[Applause]
Thank you.
>> Vani Mandava: Our next speaker is Geoffrey Fox. Geoffrey Fox is a
distinguished professor of informatics and computing and physics at Indiana
University. He's the director of community grids lab, associate dean for
research and director of the data science program at the school of
informatics and computing. He previously held positions at Cal Tex, Syracuse
University, and Florida State University. He has supervised the Ph.D. of 66
students and published around a thousand papers in physics and computer
science. He is fellow of APS and ACM. His work currently involves applying
computer science to bioinformatics, sensor clouds, earthquake and ice sheet
science and particle physics. The title of his talk is comparing big data
and simulation applications and implications for software environments.
>> Geoffrey Fox: Can you hear me? Thank you very much. Let me just check.
[Indiscernible] sort of being my ideas biased by parallel computing and
simulations to big data and I often comment that you talk to people who do
high-performance computing and simulations, they may argue about whether
the -- exactly what the right chip is but the broad principles of what it
takes to build a good supercomputer I think are understood and agreed.
My impression is that's not true in the data science or big data area. So
what I wanted do is start with some work which [indiscernible] was very
instrumental in making happen which is in this activity starting big data
which happened in the fall. And as part of that activity, I was leading the
so-called use case or acquirement activity and so we invented a template with
26 fields and we bugged people to fill it in and we got 51 applications,
which is the at least the people who filled it in felt were big data.
Whether they were is not quite clear.
And they covered these areas, government operations, commercial, defense,
healthcare, deep learning, ecosystem, astronomy, physics, the various earth
environmental [indiscernible] science and energy and I should note that
the -- so I will draw some conclusions from this will be somewhat biased
because if you look at the -- I know the big data cycles associated with
these applications, they don't sort of average out to the way big data is
processed on real computers because we don't have -- we only have one thing
called search and one thing called Netflix and those are representative of a
few applications and so it's slightly -- you have to bear that in mind in
what I say.
So this is -- this is a little more detail. These actually have all the 51
applications listed, each of which we have these 26 features. There's an
excellent website. You can just type this big data, you'll find that
website. And it has online all these use cases, all the analysis of the use
cases and all the forms their filled in. And as I say, somewhat biased of
science partly because of the Department of Energy and NASA put a lot of
pressure on their labs to respond to this -- my request for cases. NSF did
not put my pressures on their investigators. All right.
So illustration of the type of information that was gotten here is just
screen shot of some summary which has these magic Vs, volume, velocity, and
variety. Has the software used and any analytic algorithms used from six of
the use cases. And you can go and browse all the use cases and get that
information.
>>:
[Indiscernible].
>> Geoffrey Fox: I will point out that these were not spell-checked and is
full of illiteracies and probably incorrect statements and except where the
correction was obvious, I did not correct the data. This is raw data. Has
to be correlated somewhat.
As well as this, Bob Marcus, who was on the lead of [indiscernible], he
suggested ten types of applications which more come from the commercial -how do we go back? This is not ->>:
[Indiscernible].
>> Geoffrey Fox:
Am I missing the -- it's this thing.
I think it's this.
So he suggested these ten generic use cases which largely come from the
commercial data processing world. But all we have is these ten titles. We
don't have 26 features for each of them. So these just sit there as areas
which possibly are not so well covered. We do have some commercial use cases
like the Mendeley company who do citation processing. They have a pretty
interesting entry. We have a generic discussion of financial operations and
things like that, but and then we have also have more detail which didn't
have the same profile on some pretty interesting security and privacy
oriented use case covering things like education, the military, cyber
security, and area like that where security and privacy were typically
important. So we had 51 sort of rather open use cases filled in this detail,
20 additional use cases, which we have to bear in mind.
So I went through those use cases and tried to understand them to try to draw
the equivalent of the Berkeley dwarfs. People may remember the Berkeley
dwarfs which were some summary of parallel computing from the past and so
we're trying to capture the essence of these use cases. You can use the term
kernels, mini apps, patterns, and I should point out I am incompetent in
databases and so I will focus on non-database aspects of it. And if you want
to go online, I actually have a slice there, it's a recorded video going
through all 51 use cases, discussing how this is done.
And so let's just remember what happened in parallel computing. We have the
most famous kernel called LINPACK which is notable for not being useful but
being used by everybody.
[Laughter]. And welcome back to the replacement of LINPACK soon. I think
more relevantly, we have the so-called [indiscernible] benchmarks. They were
very influential, I think very important, very nicely done. They originally
were pencil and paper. They specify the algorithm but eventually they became
actual code which you could run with an MPI library.
And there are also lots of specialized benchmark sets which NSF used to use
to judge your solicitation and but the last two NSF solicitations have
abandoned those benchmark sets partly because they're sort of cloud related.
They're not quite certain what the right benchmark sets is.
So Berkeley dwarfs is -- I'll give you a little bit on that, and they're
possibly the most famous and Jack Dungar [phonetic] and others had something
called templates, which is a book or two which he published.
So here's the NAS parallel benchmarks so we can see what it is. It's
basically libraries of very core algorithms which were buried deep in the
simulation code plus [indiscernible] parallel not quite saying levelers of
conjugant gradient sort of class of applications.
We will find out from what we've done in the past that when you try to
abstract the essence of these applications, it's not possible do it at the
same level. You have to go all over the place because there are different
characteristics and it's quite clear how to do it. So I call that facets.
We have to look be at different facets of the applications.
Here are the Berkeley dwarfs and they actually came from Phil Colella who is
a brilliant -- I think he is a tri mathematician. I think he does a lot of
combustion, if I remember rightly, and six of the -- the first six came from
Phil and then the Berkeley added -- even added MapReduce.
But again, if you look at this classification [indiscernible] MapReduce not
quite in the same -- they're not looking at the same level.
So I don't think that's bad. We're just -- I'll use as an excuse for not
doing things at the same level for big data.
All right so now let's analyze these 51 use cases. And so one interesting
thing is what they -- if they're big data, they're probably running in
parallel over. And you can see they're running in parallel over. Lots of
different things. People and these are either people being analyzed or
people actually using the system or both. Items such as images, gene
sequences, material properties, sensors, events such as astronomical events,
nodes of an RDF graph, nodes of a neuro network, tweets and blogs and
documents, and web pages, files.
And as one of the interesting sources of big data are simulations, those are
parallel over standard simulation. So we're parallel over lots of different
things.
So now we try to look at the 51 applications and say what they used.
again, this is not uniform.
And
So 26 of them I decided were pleasingly parallel. 18 of them are classic
MapReduce. Some of them sort of halfway in between MapReduce and pleasingly
parallel, having some relatively trivial global sums at the end to gather
statistics together.
23 of them are candidates for [indiscernible] MapReduce or what I call global
machine learning where the machine learning is optimized over the many nodes
of a parallel application.
Nine were graph-related.
11 involved data fusion.
And 41 of those 51 involved streaming in some sense. So that was actually
the most common category. So that I called low-level.
Here are some other types of ways of looking at them. 30 of them were
classifying the items which we were parallel over. 12 of them were doing
classic search or query or indexing. Four involved collaborative filtering.
36 of them involved what I call local machine learning. That means that you
do -- you take your problem, you divide it into parts and you apply machine
learning to each part. And I think I have an example here.
So here is one of the -- use case 18 there's computational bioimaging. This
is like light sources. And this was submitted by Lawrence Berkeley Lab. And
I believe they're running complex machine learning or image analysis on each
image taken from this synchrotron source but I think they're doing it
pleasingly parallel over images.
And pathology which you also have as an entry, typically that's done parallel
over pathology and images incredibly complex machine learning on each image.
So that's what I mean by local machine learning.
Global machine learning, which is actually my way of thinking roughly the
same way as if there was a MapReduce that's a little less than half of them,
you have things, deep learning, clustering, latent [indiscernible] allocation
and things like large scale [indiscernible] dissent, et cetera.
16 of them involved go graphical information systems or the defense
applications do or the earth and environmental [indiscernible] science
involved GIS, so that's pretty popular. Five of them involved looking at the
data from large-scale simulations, two of them used agents.
So these are just ways of trying to understand the richness of big data.
I gave you that one.
So this one, that bio -- so that was local machine learning. So there's
another very talented assistant professor within the [indiscernible]
University, David Crandall, who actually works with Judy to analyze image
data by effectively trying to take the world's images and trying to do a
So
global [indiscernible] or global maximum likelihood that determine the camera
position and the orientation of each images so they reconstruct the world.
So that's a good example of imaging with global machine learning agricultures
opposed to the previous one which is the bioimaging which was the local
thing. Okay.
So here you are reconstructing all these things automatically.
currently uses either [indiscernible] MapReduce or MapReduce.
That
And that one does start off actually with local machine learning to do the
initial classification of the images, but after it does that, it does this
global machine learning. All right.
So then we can ask aspects of what's going on here. So we offer actually ask
the same question you ask in parallel computing [indiscernible] per byte and
communication, interconnect requirements and whether the application is
constant or dynamic, is it a regular graph or an irregular graph?
Most of the problems are bulk synchronous processing because most data
[indiscernible] MapReduce is BSP. Some of them are asynchronous where you
get shared memory. We already mentioned [indiscernible]. The data
abstraction key value, being pixel graph is important. And then you could
also try to work out what core libraries you needed, whether it's matrix or
algebra, et cetera. And you can also try to work out from these things what
issues about where the data comes from. Sets of files, Internet of Things,
streaming, simulations, GIS, SQL, no SQL, et cetera. And that's -- and we
also I mention the earlier, there's a class in here which is experimental
observations where you get this block streaming where you gather data for a
month. The work I do with people at the North and South Pole, they take data
for a month or so and, unfortunately, there are no trucks in the North Pole
coming to the USA but they can fly it from the North Pole to here.
Okay. So if you look at all those use cases, here are five execution
strategies using E parallel, which is probably a major -- it's probably half
the applications [indiscernible] pleasingly parallel. Includes local machine
learning.
We can abuse Hadoop do this in the language of the first speaker, or we can
use high throughput computing or many task tools.
Then there is what you might call the classic MapReduce which are the search
and algorithms including collaborative filtering where MapReduce is very
efficient. Then there's what we call map collector where through the
MapReduce will spark and things like that comes where you have a map phase
followed by a collective phase and then you iterate that. There is the class
covered by giraffe, which used to be [indiscernible], which is -- uses point
to point communication more and also tends to be iterative.
And then finally there's a class which at least in the literature are largely
done asynchronously with thread-based algorithms running on large shared
memory machines.
So those five execution models sort of cover those 51 use cases, at least I
would say they do.
All right. So this is a chart I've shown for many years, the different ways
of doing MapReduce. Given the first talk of the meeting, either uses and
abuses, and we have map only. Classic MapReduce. Those are the first two
execution models I said. Iterative MapReduce, which is map collective or map
followed by large communications structure and then point to point which is
where simulations are and where also giraffe and [indiscernible] are. So
that's -And you can cover all of them as map communication.
giraffe and iterative MapReduce, so all right.
If you want to cover
So you can now look at -- you can now just write down all the algorithms and
classify them in these areas and beginning we only have -- we don't have so
much time, we can -- so we have map only. That's local machine learning.
MapReduce, which is certainly where search and query comes in, things like
the large [indiscernible] summarizing statistics recommend assistance
[indiscernible] pass fires, map collective where we have lots of nice
algorithms, which go on to the next page.
We have map communication where we have the graph algorithms and we have the
asynchronous ones. So this classifies the algorithms in these execution
models.
So now less compare what we learned from simulation. So pleasingly parallel
is important in both. I remember in 1988 I gave a talk about parallel
confusing and identified pleasingly parallel as an important parallel
computing model so it's still very important.
Nearly many of these are single program multiple data and essentially are
[indiscernible] synchronous processing and obviously not iterative MapReduce
is used in several important problems. It's not a common simulation paradigm
though, except for the maybe some of the cases where you do a reduce face
after pleasingly parallel execution of various tasks.
Big data often has large collective communication. Brocus [phonetic], things
like that and of course reductions. Whereas simulations normally associated
with small messages, so that's -- so the difference there.
One important difference is in sparseness. Simulation is essentially always
sparse except for a few examples like electromagnetic simulation. And in the
case of big data we have some sparse cases like page rank and bags of words
which tend to be sparse but there are lots of important big data problems
which are not sparse. So there's a difference in the importance of full
versus sparse algorithms.
So this picture here is one I often used when I didn't use these
[indiscernible] design point that if you look at the graph on the right,
that's a classic Facebook-like page, friends, we are we have people linked
together. That's a typical graph problem.
If you look at the picture on the left, that's a particle force problem.
There's a lot of analogies between particle force problems and Facebook
problems.
>>:
[Indiscernible] except that here [indiscernible].
>> Geoffrey Fox: I didn't say they were the same, but there are sort of
relations between them. There are similarities. They both get executed in
the map communication model and so that's what I wanted to say. And also
there are lots of long-range force problems which are -- which you gain get
some analogies to in the big data problem because there's a lot of
[indiscernible] if you calculate distances between points, those distances
tend to define, usually confine a distance between all points as you often
can, and that looks like a long-range force problem.
So and so that's -- all right.
>>:
So --
Doing it that way, [indiscernible] order in squared.
>> Geoffrey Fox: There are order in squared, definitely. There are lots of
order in squared problems. [Indiscernible] a lot of clustering problems are
order in squared because you can find the distance between all points.
>>:
So you're implying [indiscernible] big data [indiscernible].
>> Geoffrey Fox: That's what this says here [indiscernible] multiple ideas,
yes, but there is variants in that.
It's also -- so this tends to give you full matrix algorithms and I note that
even in say like deep learning, they tend to be sparse because there's no
[indiscernible] connector. You don't get connections [indiscernible] neurons
but they are actually in blocks therefore matrix algorithms and that's why
they run very well on GPUs.
>>: There's a lot of metric analysis algorithms are not about matrix
[indiscernible] ->> Geoffrey Fox: That's actually what I tried to capture up there use
giraffe and map communication. If you look at the deep learning for face
recognition or learning how to drive a car which is done by [indiscernible]
group at Stanford, there you run very well on GPUs and that's because the
[indiscernible] of that problem is matrix, matrix multiplication. And that
type of -- so all I'm trying do here is I'm trying to compare what we've
learned in these various problems and so -So how much longer do I have?
>> Vani Mandava:
About three more minutes.
>> Geoffrey Fox:
Okay.
Well, I'll just finish then quickly.
So we're trying to understand how to implement these ideas. That's our duty
in the audience when [indiscernible] Rutgers and some of the work actually
Dennis is helping in Azure. And so basically we use this concept of
high-performance Apache big data stack by effectively adding integrating HPC
with the Apache stack and one key thing we have is we've introduced below the
spark Hadoop et cetera's layer we introduced a communication layer and where
we tried to take the best possible communication algorithms which are often
those done in MPI, so our goal is to get the functionality via the Apache
stack and the performance of HPC. So you can look at that. There's lots of
layers.
So let me just -- so these figures here, so these are from Azure. This is
Twister4Azure, which is iterative MapReduce running on Azure and this just -we have some graphs here comparing on -- for the use of collectives to
improve the performance of the -- this case [indiscernible]. This is already
using iterative MapReduce on Azure and if you try -- if you introduce in this
case the all reduced collective and implement that in an optimal fashion,
you'll get better performance. And you can compare that on broader platforms
we have here. So actually, this compares Twister4Azure, these three here
with HDInsight, which is the slowest and also with Hadoop using the same type
supremacies and you will find that using these collectors, it runs much
faster because we're doing it MapReduce [indiscernible] iterative problem
[indiscernible]. It's also [indiscernible] whenever we measure it, Java is
faster than C#. So the [indiscernible] HDInsight actually runs faster in the
actual compute part than Twister4Azure because Twister4Azure is running in
C#. But the communication part is slower.
So the idea here is by introducing collectives which we implement in optimal
fashion, so that's these [indiscernible] the things implemented in an optimal
fashion that we can get substantially better performance and this one here is
the same thing run on Linux using an optimal performance for Hadoop running
on Linux. So that's that.
So I think that's a reasonable place to stop. So what did I try to tell you?
I tried to look at these use cases, tried to identify key features, compare
those features with what we know from parallel simulations, identify sample
classes of applications, note they can be run in various different
[indiscernible] execution modes and then I -- the end here, I showed that for
one class of those, those doing map collectors, we could actually implement
these collectives in optimal fashion and then you can put that in Spark or
put that in Hadoop or HDInsight, get substantially better performance, but
you first try the idea is to identify the need for that type of execution
pattern, the map followed by collective or map followed by communication
giraffe and then realize that you can try to implement that.
So the programmer just knows I'm doing map communication and they implement
communication and then underneath the hood when [indiscernible] Linux HPC or
Azure or Amazon, the system does the optimal implementation. That's the
basic idea.
Thank you.
[Applause]
>>:
I'll stop there.
[Indiscernible] HDInsight.
>> Geoffrey Fox:
That's the slowest by far.
>>: So actually from Microsoft's point of view where you pay as you go, that
would generate the most revenue.
>> Geoffrey Fox: Definitely. And that's [indiscernible].
Get your performance -- performance [indiscernible].
>>:
Good thinking.
Were you implementing on top of MPI --
>> Geoffrey Fox: No, I'm not using MPI. We're taking the collective
algorithm. If you look at the -- for the last 20 years, people have been
building optimal algorithms for MPI do every single of the 230 primitives.
So you look at that work and you try to implement it in the best fashion for
the platform. So this definitely done say for Azure and for an HPC
[indiscernible] cluster. As in fact the MPI work is done differently for
[indiscernible] the different network topologies.
>>: Could you just comment on whether these -- the HDInsight
[indiscernible].
>> Geoffrey Fox:
>>:
These are running on [indiscernible].
All of them, yes.
>> Geoffrey Fox: These are running on -- these are Hadoop running on an HPC
cluster. These things here are Twister4Azure with different implementations
of the communication running on Azure and it's got optimized communication
that we did for Azure. The yellow -- the orange one is HDInsight running on
Azure as delivered by Microsoft without any optimization for the
communication. So it's meant to illustrate that you can take the same
concept of optimal collectives, are do it on HPC, do it on Azure, get a
uniform programming model, and then you could get better performance on both
of them.
>>: One question, [indiscernible] problems, particularly all of my problems
but I suspect this is more generic, is that there's a degree of
[indiscernible]. You know like you're operating over multiple images but
each image ->> Geoffrey Fox: You could do that, yes.
possible yes and some people do that.
That's certainly obviously
>>: I guess the question then when you're looking at these is if I have got
64 [indiscernible] for example, how many machines ought I divide -- should I
divide that into? Should I [indiscernible] 16 core machines or 164-core
machine in terms of ->> Geoffrey Fox: So I mean, this is rather crude analysis here won't
directly give you that answer. It will tell you that when you want to do the
parallel inside images, you're going to be pretty sensitive to communication
latencies and things like that around you ought to try to use these optimize
communication primitives and suggest that [indiscernible] project where she
is putting us in Hadoop plug-in operations and so she has very nice results
from that. So that's our plan. Our plan is to help you do that efficiently
by providing -- by identifying these different models, the pleasingly
parallel model which you want to use in your hierarchal parallelism and then
within that pleasingly parallel we obviously know that for a given size
problem, the efficiency is optimal for certain number of cores because that's
communication over calculation ratio which we [indiscernible] started. So
you can make those -- so if we have the right language, you can do those
experiments.
>>: [Indiscernible] your inner parallelism gets the same speed up with four
cores as it does with eight, then you're better off with the four-core
solution and [indiscernible].
>> Geoffrey Fox:
That's right.
>>:
Cost benefit analysis.
>>:
And the cost [indiscernible].
>>: That's right.
[indiscernible].
>>:
So the 16 core machines that are available on Azure that
We don't have that system.
>> Geoffrey Fox: I'm sure we can build and auto tuning solution using atlas
or some variant of atlas Azure [indiscernible] collectives and Azure the
costs of the areas instances and will tell you -- in fact automatically
deploy it on the right instance. Clearly one can do that. That's
[indiscernible] that's sort of engineering.
>> Vani Mandava: So we have one more talk after which there is break so if
there's more questions, maybe we can take it in the break. We have one more
talk right now and then break. Thank you, Geoff.
[Applause]
>> Vani Mandava: So our next speaker is Jeff Zhang. Jeff Zhang is a program
manager from Microsoft Excel. He is in charge of design and implementation
of Excel data and analysis features and tools on the Apple platform.
Recently he steam shipped offers for iPad, which I'm sure you have all heard
about. Prior to Excel, he worked on Microsoft Exchange Server and before
joining Microsoft, he was and assistant researcher at the Chinese Academy of
Science. Please welcome Jeff Zhang.
>> Jeff Zhang: Good afternoon. It works. So today I'm going to show you a
short demo about power BI Excel. We talk about Excel just now.
[Indiscernible] generally Excel is -- let me see. It's not coming up. Is it
up? Okay.
So we talk about [indiscernible] a mix of love and hate. Actually similar
[indiscernible] product but [indiscernible] today's talk, short talk.
[Indiscernible] problems and tools that we [indiscernible] in Excel it adds
some more love to this product.
So as I remember, I think this is a true story in the research area
announcing China [indiscernible] as approach graduate students professors
always assigned me tasks do a lot of like data crunching and data processing
work. Of course those are the works that are kind of necessary to come up
with good insights based on massive and various data. However, as a
researcher, this is not really something that's really that delightful in
terms of using our brain power.
So what we really want so to use a portion of our time to process the data
and probably also come up with the easy way to provide various angles to view
our date so that we can get quicker data insight.
So really, the Excel power BI tools is for that purpose. So we say that the
more data we have, the more power we have but that is only true when we have
the proper tools. So Excel power BI is actually a set of features or add-ins
including the Power Query which is the engine inside Excel and desktop which
has the capability of processing millions [indiscernible] of data and second
wise [indiscernible] procuring so data processing starting from grabbing data
from somewhere that you [indiscernible] or formatting the data to a format
that you want to be represented so Power Query really is the tool in place
for you to help you to quickly achieve that purpose.
Power View and Power Map, both of them [indiscernible] tools in Excel help
you to be able to view the data on a map or on a dashboard which enable you
to [indiscernible] your data from various angles.
So my demo today start with [indiscernible] which some of you may already see
yesterday. So this is a single video that I generated from data that I got
from one of our researchers which shows the Guadalupe River stream flow
around the river on a different locations. So the time series is from June
to November, I believe. And through this single map we can tell across this
format what a stream flow look like across [indiscernible] as well as the
locations.
So some of these [indiscernible] from this data. For example you see the
data in October there's a various obvious spike of the data sometime like
here. So without a visualize data like this, I may be able to do that
through some pie chart or line chart [indiscernible] data from a pure raw
data but by presenting it this way it's easier for me to grab insight and ask
around it's the reason for this back of the stream flow over that era is
because of the [indiscernible] only happening around this area on a
downstream of the Guadalupe River not upstream.
So this is a very, very simple example [indiscernible] idea that I'm going to
show you.
Go back to Excel. So in this several spreadsheets, you see the data like
this, those are the raw data that I got from sensor alongside the locations
that you just see on the Guadalupe River. So it has the name of the sensor
and the longitude and latitude which is the special information and also
daytime and assume flow of data. So you if you took a closer look at the
data, this data is just a raw tax or stream format directly from the device.
The device may design the way to record data that makes sense to the device
to understand or to use. However, as a common purpose to represent data
[indiscernible]. This format is really useable for visualization. So we
make a format sustain the data in time format in the table of data
[indiscernible].
What I can do here is to use the tool I just mentioned, power your query to
process the data. What I'm going first is to create a table out of this data
and then once you have the Power Query stored on your machine, and then Excel
you want Power Query tab shows that you can get the data from various
resources or the databases that you have or preprocess from some terabytes
data into the gigabyte data into your different database also not only
Microsoft ones but also Azure data marketplace, even Hadoop.
But here, just to show the power of the data processing I'm just importing
this data from this single table. And there [indiscernible] Power Query UI,
on top of the UI, you have series of the functions that you can use to
process data inquiry and like [indiscernible] or like crunching data by
reduce different kind of rows and columns.
On the right-hand side [indiscernible], you see a query name and all kinds of
[indiscernible] that you can just replay. So here is also the grid that we
just saw from Excel. What I'm going to do now is actually just to extract
data information from the first column so without this tool what I would do
is I use a formula to find -- probably find this [indiscernible] and then
extract the [indiscernible] before this character and apply that to all the
cells in the first column.
However, this although is not complicated, but still, every time when I want
to do that it takes me some time to come up with this formula. Use Power
Query, there's a single function called split column. So by [indiscernible]
first column, split and by the [indiscernible] I can use different
[indiscernible] but in this case I want to use [indiscernible] which is T and
a split from the first row and then I see that the [indiscernible] asked me
what [indiscernible] into two columns and the first one has been changed to
the date and the second one is time. So that's automatic step taken which is
to other than just purely split the data up from one column into two, Power
Query also signed the date recognized and sign the data time as date for the
first column and the second one is time.
So just to because I just want to represent the data across like several
months to see -- [indiscernible] of the data for like every 15 minutes a day.
What I want to see is really maybe every day, what's the average stream flow.
I can also do that through some of the formula or even macros in Excel to
aggregate the data.
However, Power Query also offer me the ability to be able to do such
aggregation by very simple clicks. So I just need to use [indiscernible] and
say [indiscernible] the information into the first column. What happened?
The first column, which is the data and time forum column, and there name a
new column as the stream flow average. So don't look at it. Listen to me.
>>:
[Indiscernible].
>> Jeff Zhang:
[Indiscernible].
Okay.
It's back.
Good.
So I need to aggregate the data, actually you see in the third column, which
is the stream flow, I choose the stream flow column. I click okay. And then
you see the data that [indiscernible] is only by date and also the average
data in that day.
And remember, we still have longitude and latitude data
what I want to do next is to lay all that data into one
put that data on a map. So I can, by doing that, I can
column to the table. This is the longitude information
I just want to click okay and you see it's very easy.
in original table, so
table so that I can
simply add a new
that I remember. And
These are the data that I processed in Power Query. You see that these are
things that you may be able to do in cell by cell either by formulas or
through function, using macros. But here just a few clicks within a couple
minutes I can process that data in one new table. This is just for one
sensor data but you can imagine that also do the same thing for the sets of
data in different sheets.
So [indiscernible] ask is how can I, after I process all of those data and I
want to represent them into one chart, so the next step I need to do is
really connect or append those data into one big table. How can I do that?
Actually I can also do that Power Query. So these are data that I really
want which is in one table will have all 7 sensor data into one table so what
I need to do is actually just to bring back the data that I already have
using the same tool which is Power Query. Bring back Power Query and that's
a very obvious icon we call append. By clicking append, we actually append
the various datas that we just processed into one set of data. Then
[indiscernible] on that.
[Indiscernible] the name is not the same and [indiscernible] automatically
append them. So that's how we use Power Query to process the data.
Next what I'm going show you is how to visualize them using the Power Map.
So Power Mapping is a separate [indiscernible], so once you have that
[indiscernible] in store, you will see a map pattern in the Excel tab.
[Indiscernible]. Click on map, and then you will bring up UI. Over here on
the top is one -- the visualization that I already generated yesterday. Now
I'm going to show you how easy it is to generate a new one based on the same
set of data.
Click on the new tool. And then very quickly, Power Map just automatically
can understand that I have latitude data and longitude data in my data set
and can use that as a reference and click on next. So the right resolution
here is really too small. So it's hard to see. It's already crashed.
Resolution is too small. Try it again.
Create a new tool. And then when you click on create a new tool, Power Query
will start to digest the data, look through all the data that you have in
this table and see what data it can use as a special reference and
automatically set that [indiscernible] special reference data. Longitude,
latitude.
I think we have some problem over here because resolution is really too
small.
I'm serious is right. Category is right. And stream flow I just need to use
no aggregation for stream flow. And I still need [indiscernible] longitude
and latitude probably cannot do this [indiscernible] just because the
resolution is too low that slower than past display for Excel. So I just
keep this one but I can show you later on my machine.
The last part I want to show you is another visualization tool that I just
mention the which is Power View. So Power View is essentially another type
of [indiscernible] that you can have in Excel which [indiscernible] dashboard
let's you be able to put different visualization on one sheet so that you can
view data from multiple angles.
For example on this spreadsheet, this is essentially spreadsheet but this
dashboard is like a special spreadsheet that I show you map visuals but this
one has a map visual and also line shot. Best part of this is not only that
you can see data both special data and [indiscernible] data on one view but
also the associated and you can explore the data for example if I only want
to see [indiscernible] data, I can click on this [indiscernible] data. And
then no [indiscernible]. So I can only [indiscernible] down data for a
special [indiscernible] alongside [indiscernible] to see the trend of the
stream flow.
Just to show you an example that I just enable single data [indiscernible]
for example if I only want to see [indiscernible] marine data I can use
fields for it's data by the name of the marines show only [indiscernible]
three on this map. Very easy. And then I can see a pattern of the stream
flows. It's actually very similar to each other. So ask myself a question,
why they are similar, and then I look at the left-hand side on the map so
just geographically, these three marines are very close to each other, green
[indiscernible] and [indiscernible] and they may share the same water
[indiscernible]. That may be the reason. They have a similar pattern of
stream flow, so I just explore the map a little bit more and see there's a
snow mountain over there, Mt. Ranier, very famous around this area, and then
I see that the stream flow come the most during September. Maybe that's the
time across the year that the Mt. Ranier melt most. That's again got the
most stream flows in this area.
And you see although the pattern is similar but in [indiscernible] the stream
flow is almost double, even more than the stream flow that as are between
green and [indiscernible] so I can make a guess maybe the streams just come
from Mt. Ranier to [indiscernible] and then come back to [indiscernible] and
go to Puget Sound. So this is just to show with this interactive dash board
by viewing the multiple am data on one view, really helps you to get quicker
data insights.
The data itself is very simple am and also you may think it's very
complicated to produce these kind of dashboard. Actually it's very simple.
So what I would do here is just go to -- still go to the insert tab and if
you have the power BI [indiscernible] gather with your Excel, then you will
see this Power View more than over there. Click Power View, and then you
will create a new one, a few tables that [indiscernible] spreadsheet which is
essentially [indiscernible] as the ones I just showed you for the marine data
on the quadro river so I just need store the data on to the sheets by
checking all of them. And this data is basically a raw [indiscernible] data
for the marine metadata and then I can change the design of the data by
saying instead of using a table, just represent on a map. Click on the map
button, and change the set a little bit, longitude, go to longitude, go to
latitude, and [indiscernible]. So now I can see all the marines on the map
so I did [indiscernible] map and the Power View map view, both of them use
the big maps service at the back.
And how can I add the data like the [indiscernible] on the right-hand side is
also very simple. I just need to store the data on the [indiscernible] again
by adding all of them and then go back to design. Takes some time.
This projection really makes things very hot.
So the idea here is just still you see that I seldom use my keyboard to type
in anything across the process that produced these visions but just to show
the power of the visualization and also data processing tools in Excel, you
can imagine [indiscernible] tools can save you tremendous time to get the
data insets.
So the demo is not as smooth, but some of you already saw that yesterday in
this demo, so you already [indiscernible] without this slow projection
resolution, but that's what I'm going to show today.
[Applause]
>> Vani Mandava:
Any questions?
>>: So earlier today you saw integration of variation [indiscernible] R and
Python with Visual Studio. Can you imagine such a scenario of that type of
about integration with the Excel environment to get more direct access to
data?
>> Jeff Zhang: Yeah. I think those are very good questions. So we talk
about macros just one type of the most important [indiscernible] for Excel
and other office apps and another way, similar way program manager,
programming [indiscernible] just mentioned, another programming model that we
are trying to build with Excel 2013 and going forward is called apps for
Office which [indiscernible] essentially is a [indiscernible] extension
that's been built inside Excel. For example, with that feature -- let me see
if I can show that here. Not guaranteeing that I can show that successfully.
You see that this [indiscernible] pattern over here, maps or pure graph, so
those are the new extension matter that we are enabling starting from office
2013. So those essentially are the services based on a whack programming
language on a service side and [indiscernible] that you can like follow some
of the metadata method that office requires. You can put them as part of the
content in Excel [indiscernible] or as a [indiscernible]. Let me see if I
can just try that out very quickly.
For example, this is a new [indiscernible] app or view this as a new
extension for Excel. So the content in this app is essentially web page, a
JavaScript powered web page that carries the data from Excel or right into
Excel by interacting with the user's behavior in this view. So you imagine
this is a very simple chart, people graph, but it also enables the map API
and also any other service that we already build for other places as long as
it can be exposed somehow on a web page and you can just invade that portion
of the [indiscernible] part or the visualization part into Excel and present
the gather we use the data inside Excel expression.
So it's called apps for Office.
Go back to check it out.
>>: Talk briefly about how you shape the data and bring the data to just
bring down what you want in the graph but [indiscernible] spreadsheet. What
is the current functional limit for number of rows in Excel?
>> Jeff Zhang: That's a very good question. So [indiscernible] number of
rows has a limit of [indiscernible] integer, I think, number of rows. But
with power [indiscernible], this is something that's not exposed to the end
user but power [indiscernible] as it is the power [indiscernible]. Power
[indiscernible] you can imagine is a separate engine data compressing and
data indexing engine inside Excel and the all the data that's been processed
by power [indiscernible] is starting Excel not in open Excel format but
instead [indiscernible] format so the power of [indiscernible] is when
Microsoft used the same engine or data indexing engine on both
[indiscernible] service in the [indiscernible] and on a server and which the
same bits of code we put that into Excel and the limit, the real limit of the
number of data is just by the memory on your machine. So there is some demo
which is not the demo for today but there's other demos check out via demos,
so there's a [indiscernible] or the task for the calculation or the query for
the table that has more than one [indiscernible] rows inside Excel and the
speed of the query is less than one second.
So that was tested on a very old laptop running Windows, Windows
[indiscernible].
>>: I was wondering -- I had a few questions here. So the first one is when
you save like you're doing these visualizations there and you're mapping the
different columns, when you save state, are you getting all of that saved and
like you know, so when I come back, that's exactly what I get with my
visualization there in whatever form I had it? So that was my first
question. The next one was when you said you could import from HTFS, is that
using like web HTFS and what's like the authentication mechanism if you know
because like HTFS by default like likes curve rows and doesn't like a lot of
other things. Is it using like the web HTFS restful API to grab data? And
then will it work on my Mac, would be another question.
>> Jeff Zhang:
Good question.
>>: And then sort of playing on the first question about saving state, do
you see building anything into here to allow sort of auditing of what you've
been doing in the trend towards like research reproducibility if I had some
kind of audit trail I could pass over this to somebody else and they could
see my steps and be able to reproduce what I've done and maybe tweak that if
you're trying to build Excel into like a data analysis environment?
>> Jeff Zhang: Right. So first question, auto visualizations and data
processing that is stored in Excel, as long as you open that Excel you can
see that from another machine you made from different users all of though
information is stored inside Excel as long as you saved it.
Second, the Hadoop server, so you just need to type in the server name and
then there's [indiscernible] I'm not quite sure but what I [indiscernible]
before is just purely anonymous sign in for this Hadoop server I think the
[indiscernible] should have been supported within this Power Query tool. Of
so I'm not quite clear about the exact authentication matrix that's supported
but I can go back and find it out. Yeah. Just shoot me an e-mail and then I
can.
Third thing, third question is whether those things are coming to Mac. So we
are actively working on that. We are very busy working on office
[indiscernible] so next is office format so hopefully it can come soon.
The third one is the [indiscernible] steps that we measured. So
[indiscernible] today we show that very clearly demo because of the this
projector thing but every step that you conducted in Power Query is recorded
[indiscernible] like easy one step after another in pro steps that we can pre
pro. Other than that, that's a script at the back that you can also see and
modify just by if you know the programming language for Power Query, you can
also do it with that in the programming way where you share data to another
person, they see all of those as well.
And one more thing that I didn't talk about is in Power Query, if you sign in
as a [indiscernible] users then you cannot only share the data that you just
processed to [indiscernible] that others can see or coworkers can see, but
you can also do a search for the data that's from others. For example I have
a [indiscernible]. You use his name a lot. Like just [indiscernible] some
data that he processed. And this is the data that he processed before. So I
can see all of those data and if I double click I can bring up the Power
Query, I can see all the steps that you cannot see here but I can see all the
steps that is how he processed the data [indiscernible] reuse or modify for
my future use.
Okay.
Thank you.
>> Vani Mandava:
[Applause]
Thank you, Jeff.
>>: [Indiscernible] in the back of the room and we will have our wrap-up session [indiscernible].
Download