>> Wenming Ye: Hello. My name is Wenming... Research. Today, we've invited Travis Oliphant, and he is...

advertisement
>> Wenming Ye: Hello. My name is Wenming Ye. I'm a program manager here at Microsoft
Research. Today, we've invited Travis Oliphant, and he is a Founder and Chairman of
NumFOCUS, which is a nonprofit foundation for -- an umbrella nonprofit foundation for a
growing set of open-source Python software for numerics and analytics software, and he is also
the author of NumPy and SciPy. He has 15 years of experience working with scientific
computing and Python. He has just also founded a company called Continuum Analytics, and
that company, he is the CEO there, and he focuses on data analytics and tools for services with
Python. And welcome, Travis.
>> Travis Oliphant: Thank you, Wenming. Appreciate it. It's a real treat to be here. Glad
you're here, those in the room and those who are joining us remotely and time lagged, as well, on
the DVR, I guess. I may be speaking to you in the future. Happy to be here to talk basically
about the role that Python has to play in "big data" analytics, and I put it in quotes intentionally,
because it is a term that's used, and so we have to use it, but like most of you here, I know there's
a lot of hype around it and some truth. So we're not going to explore that entirely. We'll cover it
a little bit, but mostly I'm here to talk about where I think Python is going to play a major role
and kind of some of the tools we're building to try to make that happen. First, just a little bit
more about me to connect -- maybe some of you have similar backgrounds. I'm really a scientist
at heart. I started a master's degree in electrical and computer engineering. I studied satellite
remote sensing. Here is just an image of a satellite with big beams, returning scatterometry data
over the Earth's oceans, and from that you can find the wind speed. That was surprising to me at
the time. It probably isn't surprising to many of you that the waves, they adjust based on the
wind speed, and then the scatter that comes back to the satellite changes. But it's sort of the first
example and my first taste of big data problems, and back in those days, it was a tape, and we
had a big tape coming from JPL, and we just plugged it into this reader, and I had some Perl
scripts that drove MATLAB codes. It was the first and only time I wrote Perl. I have an
anecdote about Perl and why I love Python so much, because the Perl scripts I wrote I didn't
understand after just six months, going back to the same codes. I had no idea what I'd done, sort
of right once, read never. My background really is science. This is the kind of image you can
produce based on those satellites' data. You have just global Earth wind speed images. This
same satellite could also measure ice over the Antarctic. It's just a scatterometer, backscatter,
and you just adjust what you're inverting for and you can get different images. So really enjoyed
that. I stayed for graduate work, did the PhD at the Mayo Clinic, and there I studied wave
problems again, but now these are waves of a different sort, vibrating inside the human body,
and with both MRI and ultrasound, you could take a picture, a snapshot, and get the full threedimensional -- actually, five dimensional, if you think of time as one dimension and then the
polarization as another. I can get five-dimensional data inside the body of a wave propagating,
maybe at 1,000 hertz or 400 to 1,000 hertz. So just attach a speaker to somebody, push and
wiggle them, and you can see this wave propagating. It was fantastic. And my job was to
actually do the inversion. I was trying to estimate -- I put that up in front of people at Mayo
Clinic and they get scared of all the indexes. I'm sure here people aren't scared of some of those
indexes. That image is here credit to the folks there at the Mayo Clinic, but this is the kind of
data I dealt with, and then I would have to essentially get this five-dimensional data set, and I
started to want to find derivatives of those data sets, so how to do that. And that's really what led
me to Python. I was using MATLAB at the time, and I really liked the way MATLAB let me
think at a high level. I didn't have to become a programmer. At the time, I had already done C, I
had done Pascal. I did know how to program, but when I'm thinking about science, I didn't want
to have to be thinking about pointers and abstract objects. I wanted to think about science and
not get my brain full of other things.
Most things about software engineering are really about neuroscience. They're really about us as
humans and how little we can keep in our head or how much we can see on the screen. One day,
I think there will be a department of neuroscience, computer science, so there will be actually
programs where people study software engineering as a social or as a neuroscientific enterprise,
understanding what it is about languages and why are some popular versus others not. It has to
do with short-term memory and how much you have to keep in your head, versus how much you
can understand.
At the time, I wanted to think about the science I was doing, trying to invert that wave equation.
I didn't want to think about all of the memory pointer chasing. And so I loved MATLAB, but it
was not -- it wasn't hanging together with the big data I had. I had five-dimensional data, it was
filling up RAM, filling up disk. I found Python, and so I started to use Python at that point back
in '97, '98, and I haven't really turned back, so that was 15 years ago. At the time, Numeric
existed, and it allowed me to do those same kind of high-level operations in a scripting language
or a high-level language. So over the years, though, all the work -- I ended up doing a lot of
work on just improving the ecosystem of tools around Python, to make it more accessible to even
more people who weren't as familiar with programming as perhaps I had been. So I ended up
developing, starting the SciPy project, spent a lot of time developing that project, in the process
realized that the Numeric array operator had to be merged, there needed to be some changes
made, so I wrote NumPy in 2006. All that work essentially turned me into a software developer.
I still feel like I'm a scientist at heart. I was just at Los Alamos last week, and I always feel at
home. I go back there and there are FORTRAN compilers and they're doing big NPI parallel
runs on their supercomputers, and I just feel right at home. I love the conversations. But I can
sometimes speak to the software developers among us, as well, although I've been playing catch
up a little bit most of my career as a software developer, trying to understand all the things you
all learned in computer science, while I was busy studying electromagnetics and MRI.
Continuum started as a company -- Peter Wang and I founded the company in January of 2012
basically after watching NumPy and SciPy at a lot of large organizations. Many large
organizations in oil and gas, Wall Street, engineers, at places like Procter & Gamble, Johnson &
Johnson, and seeing how they were using it and realizing a lot of the same reasons were the
reasons I used it, but they were running into trouble, too. They were running into trouble as their
data sets grew, their volumes grew, and there were these other initiatives taking place outside in
the big data world. And we said, you know, there's a classic Strata conference -- Peter Wang
went to Strata, and at the Strata conference, Hadoop was everywhere, and everyone was talking
about Hadoop. And there was almost no mention of Python at all, in the whole conference, at
least visibly, publicly. However you go to the actual sessions where people are talking about
what they do with those data, and every one of them were using Python as kind of the end result,
as the back end, in-memory data analytics they were doing. And we thought, at the time -- I
realized I'd watched people use NumPy, realized where it was falling short, changes that needed
to be made, and thought we can make some fundamental changes to the way people are using
array-oriented computing and actually map that to big data problems. We can actually do the
same kind of things that people are doing, MapReduce or large-scale operations for, so we had
this vision, had this idea, and started the company to really basically bring NumPy and SciPy to
big data, is one way to think about it.
Now, it's evolved since then. I just give a brief shot of the team as its grown since then. We're
now at about 34 people. Developers and scientists, that's one of the things we love to do, is get
people who not just only have a developer background but have a scientific background or an
engineering background or a domain expert background. You'll hear me use the term domain
expert. You use that here, I know. SME, subject matter expert, these are the terms that
computer scientists applied to people like me 10 years ago, when I was more doing science. Our
big picture is to really empower people like I used to be, or am -- would still love to think I am
sometimes, a scientist, somebody who has a real problem to solve. They don't want to spend
their time thinking about software and development and pointers and development environments.
They want to spend their time thinking about math and science and big problems they're trying to
solve, but they have to use computers. It's a big part of their problem, and so how do you build a
platform or an experience for them that lets you take their expertise and move it to the big data
that's available and then the hardware that's available? The other fascinating thing that's emerged
over the past 15 years is hardware has gotten more parallel, more dense, and we're not taking
advantage of it very well. Even at a time when that kind of hardware could absolutely help our
scientists and our experts, we're still not taking advantage of it very well. I know I don't have to
tell this crowd. You guys could probably teach me a thing or two about what you're doing to
make that possible, but our goal is to build a platform that makes this happen. So part of that is
we're big backers of NumFOCUS. At the time we started Continuum, I also, with a bunch of
other open-source participants, people like Fernando Perez of IPython, the late John Hunter of
Matplotlib, Perry Greenfield and Jarrod Millman -- we organized the NumFOCUS Foundation,
which is essentially a lot of this activity in the open-source world around NumPy, SciPy, had
been developed basically by grad students in their spare time and maybe an assistant professor or
two who wanted to give up their academic career instead of -- in order to promote open-source
software and tools. But this has been happening kind of under those individuals and as a very
community, organic thing. We decided it was really valuable to have an organization, and we
applied and got 501(c)(3) status for the organization, so we're a public charity whose whole
purpose is to promote reproducible computing and accessible science, and we do other things
like we have a technical fellowship and we have a -- really to kind of promote the grad students
that are going to make the next generation of great tools and also to promote women in science
and technology, make it more diverse. So NumFOCUS supports these tools, and we're big backs
of NumFOCUS. So definitely check it out. I'm giving a talk later tonight, a PyData talk.
NumFOCUS, one of the things we do is promote the PyData Conference Series, and all of the
proceeds for PyData go to NumFOCUS, so we try to get people to sponsor PyData, to sponsor
NumFOCUS, and all the proceeds go to building the tools and making them better. Now, as a
company, Continuum, you'll see me talk today about a lot of stuff, and all it's open source, what
I'm going to talk about today. Well, there's a tiny slide that talks about some things we sell. We
do a lot of open source, but we have to pay the bills. What we do as a company is products, and
we'll see a couple of examples of that, but not much. I won't go into much detail about those.
We do training and then support and consulting around the Python for Science and Analytics and
Technical Computing. A little bit about me, a little bit to how I got to where I am here. I want to
talk about big data now and a little bit about the hype associated with big data. This is kind of a
curve. It's from Gartner, actually, in July of 2012. Many may have seen it. It shows the hype
cycle and kind of where some of these big data technologies are -- artist's rendition of what a
hype cycle looks like. You've got the technology trigger, the peak of inflated expectations, the
trough of disillusionment, the slope of enlightenment and the plateau of productivity. And kind
of depending on what you're talking about, you're sort of all over the place there. So there's
certainly a lot of that around big data. In fact, there's a great blog post -- I don't think I included
it here. I did, later. I'll come to it later, people talking about sometimes people hear all the hype
and just think they need to use big data tools when they don't. They really can get away with the
traditional tools they're used to, but everyone is trying to do it right or do it the way the other guy
is doing it.
As you all know, there's a lot of misinformation, and a lot of what you have to do as a software
developer is try to educate and help people understand what they can do and what's available.
But one thing that is not a hype and what is happening is there is this collision course between
sort of the traditional HPC high-performance supercomputing world and then the big data
business analytics world, where people from those communities are wanting access to advanced
analytics. That's the term they use. They want advanced analytics, and they know that's
essentially linear algebra at scale, linear algebra across multiple machines. So there's this
collision happening, and it's interesting to watch sort of as that happens and technologies get
absorbed and not and confusion reigns most of the time, what's happening? Because a lot of the
problems they're solving have been solved at the big data centers before, but under different
circumstances. Maybe HPC, high-performance computing, whereas a lot of the big data
discussion is around high-scalability computing. It's about fault tolerance. It's about can one
machine disappear and the other one show up, whereas HPC supercomputer centers have always
been about, no, if it goes down, it goes down. We're going to make this thing work. It's going to
be stable and therefore millions of dollars. But there's an emergence happening of tools, and our
belief, our strong belief, is that Python actually plays a strong role in the bridge between these
two worlds. We're part of an XDATA program that DARPA is sponsoring, and actually the role
we're playing in that organization, or in that program, which is a collection of 24 different groups
around the country, all doing big data, some NPI based and some Hadoop based or Spark, really,
or Shark or MapReduce based. We're kind of bridging the gap between those and being the
Python story in that space. So I know what I'm talking about a bit here, but I'm also trying to sell
a certain story, and obviously the space of big data is big enough that a lot of stories can be told.
The story I'm telling is that Python has a big role to play as unifying a lot of these technologies.
So why Python? So a couple of slides here, talking to -- maybe some are familiar with Python,
maybe some aren't. Python, the biggest reason is really the reason I -- think of my story. I was a
scientist, a graduate student. I wanted to solve problems. I didn't want to pull out my C
programming experience and be a developer. I went to Python because it was easy to learn, the
same reason I went to MATLAB originally. Easy to learn, had a lot of libraries associated with
it. So domain experts can learn it, but yet, at the same time, Python, and Python has this and
other domain-specific languages don't, is it's powerful enough for software developers to actually
build systems. So Python is this very interesting place where domain experts and software
developers kind of merge and come together in a very productive and useful, collaborative way.
I've seen firsthand examples of that, both me interacting with the software devs of the Python
story. I'm actually still a Python committer. And kind of changes they would make to satisfy the
needs of scientists and watching that kind of emerge over the past 15 years has been really
inspiring. The other aspect, though, of Python, you could say that of several other languages.
Other languages perhaps could also fill that role. Python sort of gained critical mass. It's got a
mature library, an ecosystem that's very, very large. There's over 30,000 packages on the Python
Package Index. I'm sure it's a power law curve as far as how many of those packages are
actually something you'd care about, but it's growing hundreds and even thousands of packages
that people actually use every day. Very large community of users in any domain that you're
looking for, and I underlined syntax matters, because I say it over and over again, and it's
coupled to the first bullet point, which is it's easy for people to learn. And syntax does matter.
There's a lot of really great languages out there, and I agree, as a software developer, I think you
should learn more than one. I'm not saying everybody who's a software developer should only
learn one language. Haskell is a great language. Clojure, Lisp, these are great languages, but the
syntax is a little less accessible to the domain expert, and for that reason they kind of -- okay. I'll
hand that over to my programming buddy to make him figure that out. But Python is one that
we'll actually experiment and say, I can get my hand around this. I can do this. And it's because
it leverages their English language centers. And there are some constraints there, of course, but
it is a significant thing. So Python is being used a lot of places, not just with scientists. I thank
Charok [ph] for showing me this slide. I have other slides that show other users, and there's
users that aren't shown here that I'm very, very aware of, but you can see big names, like
YouTube -- big systems. In fact, we teach a class with Raymond Hettinger. We teach a lot of
Python classes, and one of the things you do is, you show up, you go there, and people kind of
question, well, can Python scale? It's this little scripting language. And you just have to show
them YouTube, YouTube, Dropbox. These are written in Python. These have scaled. You can
scale. As many of you know, scaling has less to do with the syntax of the language and more to
do with how you connect it, how you connect the pieces, how you actually set up the architecture
in the system. Very large organizations, certainly NASA, Google -- so none of Wall Street is on
this list, and I can walk down New York City and actually walk into most places and be
recognized, actually. It's a little bit unnerving, honestly. You walk in places and they go, oh,
yes, I know you. We're using NumPy and SciPy all over the place. Oh, really? Okay, sorry,
sometimes. But it's been really exciting to see big investment banks adopt it, 5,000 developers,
and actually those investment banks are also some of your customers. They use Windows
tremendously, hugely, huge Windows users. So two of the biggest investment banks are using
Python. JPMorgan and Bank of America have huge programs with Python. Most hedge funds,
and they don't want me to ever tell you who they are, but almost I would say 80% of the hedge
funds are using Python internally. A few interesting Python stats, also from Charok [ph], except
the second one. CPython.exe, and this sort of illustrates the power that Python has for Windows
users. It's a very, very large group of people who are using, Windows users and also Python. So
it's a great way to build community as a Windows platform, because a lot of Python users, it's
much different than, say, other communities that are sort of very, very Linux centric. There's a
lot of Windows Python users. 21 million downloads of just CPython.exe from Python.org. That
doesn't include all the distributors that also have Python. Enthought has a distribution, Active
State has a distribution. We have a distribution that's newer than those, but it's been around for a
about a year now called Anaconda, and it has 180,000 downloads a year at this point, even
though it's only been a year old. 65% of those are Windows, so a lot of Windows folks
downloading and using Python.
So Python in science, and so my particular emphasis is on how Python tells a story about
science, and by science, I'm pretty inclusive with that term. I think about data science. I think
about anybody who's building models and trying to make predictions and then trying to get data,
build models, make predictions and follow up with changes to those models. That's a lot of
people, actually. I say Python is the language of science, and a lot of people back me up there,
nowadays. Lots of our users might disagree. There are a lot of folks using R -- what I like to
say, though, is that R sort of has the ear of the statistics department, and as well the scientists
who their analytics is actually just they grab somebody from the statistics department to do their
analytics. So some of the biology scientists and so forth -- a lot of attention to R in that group.
Python has the attention of all the other departments in the university, so physics, engineering,
computer science, all of those folks are using Python. The IPython Notebook, which you all
know here and are using it productively, it has really taken off in the past year, year and a half, as
just a tool for reporting, showing, talking about your scientific work, no matter what your
language is, actually, and there's R hooks for IPython Notebook. Then the other new
development over the past two or three years is Pandas, and Pandas is a library built on top of
NumPy that makes data processing more accessible. I'll talk a little bit about it later, and it's
even started to convert some of our users to Python, so I actually had a lot of conversations with
the R developers over the past 10 years, and it's interesting. Some of them have come and said,
we need to get people off of R, because R is just not a language that can scale very well. It was a
really nice research tool, but then people are using it way past that cycle, and some of them quite
adamant, how about we just get people using Python instead. Of course, that's easier said than
done, obviously. People invest a lot of effort in their scripts, but it is interesting to see, there are
a lot of people who recognize the benefits of having a language that's a general purpose language
that can actually grow beyond an open-source community, sort of only a DSL. For those who
haven't seen Python or NumPy, we can actually take questions. Does that work for the video
recording, if we take questions? Happy to answer them, actually.
>>: Just wanted to ask how you saw MATLAB fitting in?
>> Travis Oliphant: Sure, sure.
>> Wenming Ye: Travis, can you repeat the questions.
>> Travis Oliphant: Yes, thank you. Appreciate the reminder. The question was, how do I see
MATLAB fitting into Python and science? So MATLAB has a strong story here to play, as well.
A lot of user base, a lot of folks using MATLAB. There is a strong migration from MATLAB
happening right now, especially among sort of non-Simulink users. MATLAB still has a very
strong, and sort of the only, story when it comes to embedded digital signal processing and
embedded systems. They have a very, very nice product called Simulink. And then, there's a lot
of users of MATLAB, so my perspective, of course, is biased, but everyone I talk to is just
migrating from MATLAB. A lot of reasons for that. MATLAB is still a great set of libraries. In
fact, I've been talking to the MathWorks about just supporting Python and selling a library
product into the Python ecosystem. I think they'd do very well, still, and there's some movement
in that direction, actually. So I think MATLAB is going to be around for a while, just like SAS
is going to be around for a while, just like SPSS. There's a lot of tools in this space, but in terms
of default, this is actually the way science is published, five years from now, I see a lot of Python
and a lot less MATLAB, but great question. It's hard to predict the future, of course, and so it
really comes down to what people can use and what's accessible. It might take longer than five
years, because actually, it really takes as long as the professors and what they learned and what
they're used to. It really is. People don't change much, fast paced. Once they've learned a
language, once they've learned a way of doing things, it doesn't change much. So young people
are all using MATLAB. Python, some of their professors are still using MATLAB. Examples.
If you haven't seen Python, sort of how it works, I borrowed this. There's two links there,
actually, and I was going to go to them, but I think I won't for lack of time. If you see these later,
you can go to those links. It just illustrates some syntax of how you do certain operations with
Pandas and NumPy. Here's Pandas. It's basically based on a -- babynames is a data frame
collected from a whole directory of all the baby names that were listed in every year from 1880
to 1990, and it has the name, the gender and the number of names, number of people named that,
of that gender that year. Sort of collect all those into something called babynames, and then you
want to basically add a column to the data frame that is a probability, a frequency-generated
probability of how many were named this, what's the percentage of people that were named this
in this year? And so this is how you group by, baby names grouped by the year and the gender
and the sex, apply this function to this group by result, and the function is pretty straightforward.
It just is a simple array-oriented divide, so dcount is a whole column of data, and if you divide
that whole column by a single float, which is the sum of the result, that's what you're billing as
another vector of data that is that list of probabilities. Here's an example of NumPy usage. This
is a very simple example of just getting a linear spaced data, 20 data points, then calculating the
sine, and then maybe another 500 data points, because I only have 20 samples, and then I want to
interpolate to 500 samples, so you pull out of SciPy interpolate, interp 1D, do a cubic
interpolation, and that returns you a function. You then call on the data, the new data set, X, and
you get back samples on that interpolated grid. And then here's a plot command, and the thing
I'll point out here is this is how you select out just the positive numbers. So this particular plot
will show a sine wave and dots where the sine wave is positive. This other code here is actually
the Game of Life, implemented in NumPy. I talk often about array-oriented computing, to try to
illustrate to people how when you think about things as an array, it often simplifies the code, and
the second corollary, which is still we're working on making that true, is you can speed up the
code. Well, you certainly can speed it up with NumPy, but you can even, once you get a
compiler on top of that, you can create more optimized code to give you a lot more information
when somebody hasn't created the for loops. They've just given the expression they want passed,
and many of you here are aware of those abilities and techniques, generally. But it illustrates
that with NumPy and Pandas, you can write high-level code, do high-level things with very little
code and quickly, and it operates -- NumPy is really a library of pre-allocated, pre-computed
loops that do it all in vectorized form, very similar to MATLAB, actually. It gives you much the
same result.
Now, in the big data space, I call it the problem of Hadoop. I'll show my biases here just a little
bit. Hadoop definitely has some -- it definitely has some positive things, but in the current hype
cycle and the amount of press it gets, it's sort of far oversold based on what it can do versus what
people think it should be doing. I hear a lot that Hadoop wants to be the OS for big data. I'm not
even sure what that means, actually, unless all of our OSs are going to be JVM based. But the
part I know quite a bit about is that advanced analytics and Hadoop don't blend very well. A lot
of people just count stuff with Hadoop, and they're trying to add advanced analytics, and it's a lot
of work. I think there's better solutions. I think there's a better approach. And then what's
happening right now is a lot of people are using Hadoop and they don't need it, because they're
led by, well, that's what everyone says I need, and so I use Hadoop to do data, and they have 600
megabytes of data. There's a blog post here that was just recently published, and they got a lot of
hits, and I've seen this also in practice -- a lot of people don't know, and someone tells them, and
so they go, I've got big data. I've got a gigabyte of data. What am I going to do? It doesn't fit in
Excel. There's a whole space of doesn't fit in Excel but I don't need Hadoop, and that's not being
communicated very well, generally, and so there's a lot of people going down incorrect roads. I
think there will be a backlash to that, and probably an inappropriate one, because Hadoop does
have uses. It does have use cases. When you have really big data that doesn't fit across a single
machine or a single disk. I still think there are other, better solutions than Hadoop, even in those
cases. If you do need Hadoop, I say give Disco a try. I've seen a lot of people use Disco very
productively. It's got a lot simpler of an interface. It's not JVM centric. The fact that it's written
in Erlang is really hidden from you. It's not really front and center. You can write MapReducers
and whatever you like. It does do the MapReduce part. There's other solutions to HTFS, and
this is one thing I'm kind of interested in over the coming years, is Red Hat and Ubuntu, and I'm
sure you have a storage solution here, as well, and Amazon. They're sort of these key value
storage solutions that are emerging in the data centers, already, that really serve the same
purpose that HTFS does for the private clusters. Red Hat has GlusterFS they're promoting. I'm
actually a big fan of Ceph, CephFS from Ubuntu, it's got some really interesting technologies in
it. Swift is the OpenStack equivalent of S3, and I'd love to get more familiar with what
Windows has in this key value store, the Windows Storage Solution, Azure Storage Solution, so
I think that's a great thing to be thinking about and storing your data in. And then there's a lot
Python wrappers to HTFS, as well, in Hadoop, that can really take a lot of the pain away from
you with interfaces to MapReduce. Now, one thing Hadoop is doing well is it's mapping code to
data. Their distributed file system is connected to the scheduler, and so when you do a
MapReduce problem, it does try to move the portion of the code you want to run near to where
the data is, rather than pull the data around, which is a lot of organizations I go to -- for example,
I spend a lot of time on Wall Street, and one of the big problems I've been a part of solving is
credit risk. So people are out there, they're trying to understand, as you can imagine after the
2007, 2008 crises, what's my exposure to these companies that shouldn't be failing but actually
can or maybe do? Not even companies, but now countries, what's my exposure to Greece?
What's my exposure? These large investment banks actually do tens of thousands of trades every
single day that are over the counter, meaning there's no exchange. There's just a phone call with
a salesperson, saying, hey, I want to do a deal. And there's basic terms of those deals -- basic,
common terms. Those deals are all rolled up, and now you have this exposure, this partner,
counterparty, with whom you have a lot of deals. And you want to be able to, on a regular basis,
roll up, well, great, how much profit am I making from you, yee-haw, and then there's another
group going how much profit are we making and are we expecting them to pay us? And maybe
they won't, if they go out of business, and so what's our exposure to them? Those calculations
have to be done regularly -- in fact, as soon as possible and connected to the trader who's making
it, ideally. That's a lot of data. To really do it correctly, you've got to have all the firm's data
valuable, accessible and ready to go. In fact, I know how to solve that problem, basically, with a
single array-oriented solution that takes about 20 lines of code, and it's really quite simple. It's
really quite simple, if you can actually organize it all together, but they spend millions of dollars
-- actually, they can't solve it that way, because there is no -- it's about moving the data. There is
no place to store data like that and then do that expression on it. So they spend a lot of money
that effectively comes down to grabbing this encapsulation and serializing it down to over this
encapsulation and this object and pulling it over to this object and this database until your head
spins and you're thinking, how is this even working? And it's very unstable, it doesn't work that
well, and so that's actually one of the motivations for some of the things we're trying to do, is
help build -- empower the domain experts to still think at a high level, but then we'll have the
system actually organize the data correctly and well. So that's -- and these folks aren't even
thinking about Hadoop. To them, you stick around the Silicon Valley crowd, you think Hadoop
has won and everybody's using Hadoop. You go to Wall Street, you go to the oil and gas
companies, you go to big engineering firms, they don't even rally know what Hadoop is, still.
Do I care about that? And most of the time the answer is, no, you don't, because it's not going to
help you with your fundamental problems. So all of this really comes down to the idea that data
has mass. This is not new. A lot of people know this. But what are the implications of this, and
I think we're just still starting to understand the implications of this, and what does it mean for
programming paradigms and how we actually treat the way we write software. So you can't
move data around. IO is not increasing at the same speed compute is increasing. So that has
physical implications and limitations. Here's a perspective, I sometimes tongue-in-cheek call it
data covariance, this idea that let's stop thinking about it from the perspective of the workflow.
When I'm on the train station, watching data go by, and I'm building up my workflow and
thinking about how data is moving through me, that's the train station platform perspective. How
do I think about being on the train, being the data, and then what happens as code comes to me?
If I'm staying still and code comes to me, what does that look like? It's kind of just inverting,
flipping it around, thinking about it a little differently. What does that mean for programming,
what does that mean for compilers? What does that mean for the way I specify type systems and
concepts. I think it actually has some fundamental perspectives, and, in fact, a lot of these
perspectives are actually captured by a paper, one of whose principal authors was a Microsoft
gentleman, Jim Gray. Perhaps you're seeing this. I don't even know where Jim is. Maybe he's
still here, maybe he's not. But if he is here and he sees this, great paper. This has got to be my
favorite all-time best paper that I've seen, addressing this question of scientific data management
in the coming decades, and by scientific data management, you can just really say all technical
data management to do something useful with. But how do you put the scientist back in control
of his data? I won't go through all these quotes -- there's a bunch here, and if you see the slides
later, you can maybe read them, but I would just recommend going and getting this paper and
reading this paper. It's really phenomenal. He talks about science centers. He talks about
actually the fact that data is going to be sticky. It's going to stay where it is and people are going
to be coming to the data. That's been happening in the scientific world for a long time, but really
quite badly, when it comes down to the tools scientists have. They've got an SSH and they SSH
in and they maybe run a script, and that's the best they can do. Certainly, that would not have
made Microsoft popular as a platform, if that's the way you presented to DOS users back 30
years ago, right? Here's your prompt and just go do your word processing this way, with a single
terminal. It talks about metadata. You have to have metadata to enable access, and you have to
have metadata provenance storing all these things. It's amazing, actually, to read this paper and
realize we're just trying to catch up with him and the authors of this paper and trying to figure out
-- they've really laid the foundation already. Data independence, set oriented data gives
parallelism. One of our main thesis concepts is actually a lot of what scientists need to do is use
databases. Databases really are the preeminent data has mass, data sits inside the database, you
run store procedure, you run SQL queries and that actually runs processing on the data. One
challenge that SQL really isn't powerful enough and a lot of database companies put their own
special brand or secret sauce to procedures to make it powerful enough, and how do you actually
take that idea of stored procedures and expose it to general computing?
Scientists don't use DBs mostly because of that, because their problem is they need full
programming languages. They need full power. SQL isn't enough. They needed arrays. It's
actually only recently that Postgres added arrays as a fundamental type into their database. And
scientists have basically been asking, aching for an array-oriented database for years, for
decades. So there are a few now emerging, SciDB, Stonebraker's product that Paradigm4 is
promoting is the first that I've seen that actually gets closer to this, as well as what they've done
with arrays, types and Postgres. But there's basically they can't manipulate their data once they
load it, and for a scientist, that's the death knell. If I put my data somewhere, and then I can't do
what I want with it, I'm not going to put it there. Don't handcuff me. My data is everything. It's
relay critical. So if I put it somewhere, I need my data. I need to be able to do whatever I want,
whatever I can possible do with it. So how do you really provide that to folks? So if you take
this controversial view, perhaps -- I don't know that it's that controversial, but that the file
formats that scientists are using, HDF, NetCDF, FITS, if these are nascent database systems that
provide this metadata and portability but they don't sort of have the execution engine around it,
you can kind of see a fairly clear path that integrates these communities. It's a fantastic quote,
because it was written in 2005. I hadn't read it until we started Continuum, but essentially what
we're building with Blaze is exactly that. That's probably the best description of what we're
trying to do with Blaze that I've ever seen. It's a hard problem, but we're getting there. We're
making progress. I'll move on. The key question we're trying to answer is how do we move
code to data while avoiding data silos? I don't want just to tell folks, okay, great, go move your
data to this particular silo and then you're done, then we've got all the tools for you. Now, as a
platform provider, as a data center provider, maybe you do want to say that. I'm not arguing that
that's not a bad business model. I think, in fact, the next decade's battle is about where people
are going to move their data and who's going to control the compute around those data. It's why
every platform provider says upload your data for free, no problem. We'll pay that bandwidth
cost, because it has real implications. Once the data is there, it's much more likely to use
compute around those data. That's going to happen, and that's great. I think who's going to win
is who provides the better tools around those data to provide scientists what they need to get their
work done, and by scientists -- I use the term scientist, but you could just replace business
intelligence user. You could replace any domain expert or advanced user who doesn't want to be
a programmer but needs access to programming-type tools to understand the data. All right,
switch gears a little bit. I've talked, so that's kind of my soapbox part of the story, I guess, and
kind of the perspective from 30,000 feet, the way I see the world, but that's led to us at
Continuum building a certain collection of tools, a certain collection of open-source tools that we
think are very valuable and real excited about, and we've been so thrilled to be able to get
DARPA funding that's allowed us to do more of this in open source. That's why a lot of these
tools are open source and we can keep them open source and aren't just trying to keep the lights
on by selling something else.
So the five tools I'm going to talk about, and I have one here in the corner that I'm not going to
talk about, which is also interesting, I think, are Conda, Numba, Blaze, Bokeh and CDX. And
I'll basically talk about Conda, Numba and Blaze and a little bit about Bokeh and CDX, but I'll
kind of show Bokeh and CDX more. So Conda is a cross-platform package manager, with
environments. Fundamental thing about moving code to data is code is very flexible. So we do
have a story that has a particular kind of code we move to data, but Conda is about whatever
code you want. It's about whatever you've built, we can take that environment, reproduce it
reliably and repeatably on whatever platform, whatever environment you care about. It's written
in Python, with Python, but it's actually not Python-centric. It's a package manager. It can do
anything. It can do node. It can do whatever you like. Talk a little bit about that. Numba is our
array-oriented Python compiler. It's about people writing expressions. The goal here, the
motivation for Numba, is really the fact that at Los Alamos and every other National Lab,
scientists are still using FORTRAN. You talked about well, how does MATLAB fit into this?
And we really should have also discussed how does FORTRAN fit into this, because if you
recall, FORTRAN was a high-level language, and still is. If you look at a vectorized FORTRAN
implementation of some of the scientific codes, I could have written the Game of Life in
vectorized FORTRAN, and it would have looked very, very similar, actually. The cool thing,
and the thing that keeps scientists still using FORTRAN is that's still how they get their fastest
code. The vectorized FORTRAN compiler still produced for them much faster code than any
other tool, so my motivation is to actually say, well, we have the same information in NumPy
expressions. We ought to be able to produce as fast a code as they're getting out of vectorized
FORTRAN and still stay in this high-level ecosystem and forget the compile step and the
iterative -- the setup step. So that's the motivation for Numba, is to take array-oriented
expressions and map them to not only CPUs, and this is where I think we can actually maybe
even beat FORTRAN in a few years, is if we take advantage of hardware faster. If by working
with NVIDIA and working with AMD and working with these companies and actually bringing
online GPUs, we can work quickly to get the compiler targeting those architectures. That's an
unanswered question, but it's one I think we can try to approach. Blaze is sort of the centerpiece,
and partly because this is the idea that really started us, trying to figure out how to help people
talk about their data, keep their data where it is. Two ways to talk about Blaze. One is, if you're
a NumPy or Pandas user, this is NumPy and Pandas for out-of-core distributed data. So if your
data is too big to fit in memory, you still want to do array-oriented calculations, that's what Blaze
is for. Now, if you're not, and if you're sort of well, I'm sort of more for the database
perspective, then Blaze is basically a general database execution engine that maps to an adapter,
to an array server, so it can sit on the top of any database and present a programming
environment and be able to take Python code and use that as a stored procedure for your data.
Bokeh is our browser-based interactive visualization tool for Python users. It's similar to D3.
We get a lot of questions about why not D3. There's reasons for that, but one of the fundamental
reasons is we want Python users to be first-class citizens in web interactive visualization. We
don't want to force you to have to be a JavaScript developer. A lot of reasons for that. I mean,
one of the reasons for the popularity of Node, obviously, is a lot of people use JavaScript, and if
they're using JavaScript on the front end, they want to use JavaScript on the back end, kind of
have this single -- well, the opposite effect is true, also. You're a scientist using Python on the
back end, you want to use Python on the front end, too. Now, we're not going to be able to
convince all the web browsers to use Python, and we don't really need to, actually, because you
can actually generate that. You can generate the JavaScript necessary to actually do the
interactive visualizations but let the developer not have to worry about that, much the same way
that most developers don't worry about the fact that Qt or Windows Presentation Form is C++
code, they still do it in Python and the code is generated for them that's needed to do that
binding. You can do the same thing in a browser. And then last one is a Continuum Data
Explorer, which kind of emerged from the tools we were writing as part of the XDATA program.
In the corner, here, is a thing I'm pretty excited about. This is in the same spirit of Bokeh as
interactive visualization in the browser. It's building apps in the browser. Many people build
full scientific data apps backed by technical workflow, but only having to think about that
technical workflow, and then just having -- and maybe a little bit about the DOM elements that
you're updating, the document object model elements you're updating, you just write all that in
Python and the web app is generated for you automatically. That's a little project called Ashiba
that's just about ready to be released as open source. We've got some -- it's still very nascent,
very new, but I'm pretty excited about it.
Kind of all of these fit t as a single coherent story, actually, as our emerging platform, which we
call it a rapid app platform for subject matter experts or for domain experts. It's to enable people
like me, as a young scientist, people like my friends who are still scientists, to be able to build
full-scale apps quickly so that their brain can focus on the research, the science, and they don't
have to then go through the long process of translating that to even just get a demo up, just to get
something that shows what they want to do.
That platform, how do you build that? And we certainly have had that on the desktop for a long
time. Python has been trivially easy to build desktop apps with these kind of tools. We want to
make that as trivially easy to do it in the cloud, in the data center, in the web browser as the app
tool. So that's -- it's got multiple components. Wakari, I'll show a little example of Wakari as a
data analytics engine, or excuse me, as the web browser component and the infrastructure on the
front end. Anaconda is our distribution of Python on the desktop and also on the server side.
Binstar I haven't talked about, briefly mention it. It's our basically artifact repository for binaries
so that you can easily update your Conda environments.
Okay, so I can tell that I'm going to run over time here, so I'm going to have to speed up a little
bit, but I think I've kind of set the stage exactly as I wanted to, and now I'll just talk about some
of the technologies and happy to answer questions or, especially afterwards, too, we'll have 20
minutes to talk after the talk. So Conda is our package manager, and it solves a fundamental
problem that we see everywhere we go. It's between the developers who are the people that
actually make things happen in an organization -- these are the quants at a Wall Street or a hedge
fund. These are the geophysicists at an oil and gas company. They're the engineers at Procter &
Gamble or Johnson & Johnson or at an aerospace company. They actually make things that
make the company work, and what they want is access to the latest and greatest. They love
Python, because it's full of a community of people that are active and do stuff, and that means
there are new packages and there's new versions of those packages and there's new things
coming out every single day, and they want to use the latest and the best in their next project.
And then, of course, you've got the people that have to produce this in production. They
typically call those IT guys or information technology. They have to reproduce this. They want
it to work and be repeatable. There's a natural tension that builds between those folks, and I
think that there's just a lack of tools that have let those folks cooperate more easily, and so Conda
exists as a tool to help bridge this gap between the people that want rapid development and the
people that want stability and reproducibility. So Conda is full package management. It's like
yum or apt-get for Linux, but it's cross-platform, works the same on Windows, Linux or OS 10.
It also has this one thing that yum doesn't really do and should, actually, which is this control
over environments. In the Linux world, things like Docker, IO and other kind of lightweight
virtualizations are sort of doing the same thing, but Conda essentially gives you lightweight,
virtual environments easily. You can build one, build an app, have it center in that environment
and you know, even though somebody in another environment can install a new version of
NumPy or a new version of scikit-learn, that's not going to affect your app. Your app still works,
it still happens. Most of the battles that happen in the IT organization, it's really remarkable -they're battling over, well, I need this version of NumPy for this to work, I need this to work.
The other way they solve this is to actually collect all of it together into a single binary and then
you have 15 versions of Python. You don't do any sharing at all. Of course, this is the same
DLL problem that Microsoft has dealt with for many, many years. But in Python, we can do
some very interesting things. Conda's architected to be able to manage any packages. It could
manage, R, Scala, Clojure, whatever. Obviously, those have their own packaging worlds. We're
not going to try take over packaging. We're just trying to make it easy to build packages for
them. Use a sat solver, a satisfiability solver to manage dependencies, user-definable repository.
It's really quite a mature product at this point, and we've even gotten some people that are
starting to use it for their own distributors. Pyzo is another distribution from German folks, and
they're using Conda and think it's great. Conda is associated with an online service called
Binstar. Binstar, basically, you can build a command from a recipe. You can build a Conda
package from a recipe. We have a lot of recipes up there on GitHub, publicly available. And
then you upload the recipes to Binstar. Presently, you upload a built package, but we're in the
middle of actually writing a build server, so that once a recipe is written, it can actually build a
package for Windows, Mac and Linux for you and then be available on Binstar for anybody to
download. And Binstar becomes a place you can actually have trusted repositories. Anaconda
will be a trusted repository. Any other organization could build their version of the distribution
they think they want to have somebody trust. And you can add multiple repositories to your
configuration file.
So it's all about connecting people to their packages and making sure that systems stay
configured. I'm sure I could talk to people here who have done similar kinds of things in
probably even much more sophisticated ways, but this is meant to be a free service to all the
open-source projects that are out there, as well as deeply connected with Conda and the
environment notion that we have. So I believe we actually solved the packaging distribution
problem. Python is getting better. Python has been a mess for a while. It's getting better, and I
see a lot of work in this direction, and probably a year from now, they're going to be where
Conda is today, maybe even less time, but maybe even more time, too. I talk, I know these folks.
I know people in the Python world. Mostly, they have a different set of use cases, and in fact,
during one conversation with Guido, because we've been lamenting over the fact that disk utils
didn't solve our problem, packaging doesn't solve our problem. He basically said go write your
own. Don't wait for us to do it. Just go do it. And so we took him on his word and we went and
wrote our own. That does mean, though, there's Conda, there's Pip, okay, which one should I
use? And the answer is, use what works. It doesn't matter. You can use Pip inside of a Conda
environment. It's not either-or, it's about use what works and what helps you manage your pain.
So that's Conda. Happy to ask questions about that later. Anaconda is just a collection of
packages. It does have a single-click installer, very popular on Windows. A lot of our users of
Anaconda are Windows users who go to our page and single-click install Anaconda and get a
collection of all these packages they need for their distribution.
This is the one slide I do talk about some things we sell. We do sell some add-ons to Anaconda,
which are proprietary, and one is we take the compiler we have for GCPUs and we actually
target GPUs. So Accelerate will actually take Python code and run it on the GPU. There are
some examples. It's the easiest way to program a GPU. It's awesome. I've programmed GPUs
the hard way, and you still have to sometimes. If you want to get everything for like a big matrix
multiply problem, but I'm super excited about Anaconda Accelerate and Numba Pro inside of it,
which targets the GPU with Python code.
So IOPro, we basically sell speed and we sell connectivity, so IOPro is about connecting to your
Vertica database quickly. It's about -- most typical connectors to ODBC databases, they'll bring
everything into Python objects and then convert it to a data frame or a NumPy array, so you have
a lot more memory use and it's slower, so IOPro bypasses that, makes a very, very fast
connection to in-memory data structures that are going to be valuable for science. And then MK
Optimizations, we just link against the MKL compiler for NumPy, SciPy and the rest, so we do
sell that.
Anaconda also comes with a very interesting thing called Launcher, so everybody who
downloads Anaconda has this one place they can go, and it will show them -- essentially, it will
show them all the packages. Every launcher points to one or more Conda repositories, and those
repositories, each package in the repository can have an entry point and an icon, and if it does, it
will show up on this launcher. So even if you don't have it installed, it will show you, hey, this
can be installed by just clicking this button, and now you have it on your desktop, as well.
>>: So what's missing on this slide?
>> Travis Oliphant: Python Tools for Visual Studio is missing from this slide. Totally agree.
This has been a project we've been trying to work on, and coming in here will certainly help
motivate to get this working within the next -- hopefully, not more than a few weeks.
>>: Just a related question. What do most people in the community currently use as their
development environment?
>> Travis Oliphant: That's a great question. It's pretty diverse, actually. Spyder is one that is
quite popular. Compared to Python Tools for Visual Studio, it's not close, right? It's okay. A lot
of people are going to IPython Notebook, recently. Though it's different, it's not apples and
apples, it's apples and oranges. So a lot of people prefer to have an IDE still. A lot of Wing.
But again, a lot of Python users use Wing, but scientific users won't, because they don't embed
IPython console. They don't have a console there that's IPython aware. So there's a few new
ones. Ninja IDE has come up. Enthought has a tool called Canopy that was recently released
this spring and it's getting usage.
>>: There's also PyCharm and PyDev.
>> Travis Oliphant: PyCharm and PyDev. And actually, PyDev is very, very popular,
especially in industry, and PyCharm is another one from JetBrains.
>>: So not so much Eclipse?
>> Travis Oliphant: Well, PyDev is an Eclipse plug in.
>>: But both PyDev and Spyder now are like one-man projects that are not very, very active. I
mean, somebody's maintaining it, but not the way we're pouring calories into PTVS.
>> Travis Oliphant: No, PTVS has become -- I think -- well, I don't think as many people are
aware of it. And as they become aware of it, they are very excited. So we're very excited,
actually, about trying to promote PTVS to the Python community. I think you'll get a lot of
people very excited about it. So that's kind of our distribution story. A lot of bringing code to
data is honestly just nuts and bolts of packaging and distribution and kind of boring things like
that., yet, what scientists care about. In fact, when I wrote SciPy, most of the effort was actually
building the Windows installer, which was also most of the benefit, right, because once you have
a Windows installer for SciPy, then everybody used it. But before you did, you got 10% of the
group, people using it. Windows is a very popular platform for scientists.
>>: Can we quote you on that? No, seriously.
>> Travis Oliphant: I'm happy to have you quote me, and particularly scientists in industry, who
are very quiet. That's the other thing about scientists in industry, they don't go to conferences.
They don't stand up and talk about what they're doing, but you go and you look and you see,
wow, there's 5,000 developers who are all deeply embedded in Windows and having big
applications written on Windows. One project you definitely should be aware of is enaml,
nucleic/enaml, which is currently the way JPMorgan writes all of their GUIs, and it's a very
simple QML-inspired declarative syntax for writing GUIs with Python, very easy to use. I have
some anecdotes to share later if you're interested.
>>: And the good thing is, like, they often have money and willing to pay.
>> Travis Oliphant: Correct, correct. Yes, exactly. All right, so next, shifting gears a little bit
now to kind of some of the other technology we're building, and I'll only have time to talk briefly
about it, and you can go online and learn a lot more about Numba. Numba is actually fairly
mature at this point, even though it's pre-1.0 and will be for a few more months, because we're
re-architecting it. There's a next-generation result. Well, essentially, what happens is you write
Numba and you realize we've actually written a little language here and we should just formally
specify the little sublanguage and actually have that be -- it's essentially, from what I see it, it's
Julia in Python. It's sort of the same ideas that Julia's promoting, but it's just in Python and
connected to Python very deeply. Numba was all about getting the low-hanging fruit that was
out there, and there's a ton of low-hanging fruit for scientists who are using NumPy, therefore
using typed arrays, and then starting to write for loops over NumPy arrays, and yet in Python. It
was really, really slow. And with the LLVM project, it wasn't that difficult to create essentially a
source and source translator from a certain class of Python problems, Python use cases,
particularly ones where they're using NumPy arrays or other typed containers and create fast
code for it. It was not meant to be a JIT, directly. In fact, we do use the term JIT, but technically
it's closer to an import time compiler or something like that. I guess it's not a tracing JIT. It
doesn't watch your code and then try to speed it up. Actually, you just say which code do you
want to compile and which do you not want to compile, very explicit about it. So you want to
take that high-level arrays and people using typed containers and create fast code, always been
possible, but before LLVM existed, you would have to do a lot of the code generation yourself,
and like I like to say, writing in compiler is easy if you don't have to worry about the parser or
the code generator. It's what we had here, basically. We didn't have to worry about either one,
so it was really nice. Numba comes from NumPy and Mamba. Mamba is a black mamba, fast.
A little corny, but it's easy to say and easy to remember, so it kind of stuck. It's built on top of
the LLVM library, like I said. That's really where it gets its code generation from. We leverage
it heavily and so the new versions that come out, and it's also why we can target the GPU,
because essentially, NVIDIA, their whole compiler chain, NVCC, actually uses LLVM to
generate PTX. That's what they're using. So it's Apple, of course, has put a lot of money into
CLANG, and it still works on Windows, which is great. There's enough people who have put
making CLANG work on Windows that LLVM works on Windows, and you can build Windows
binaries, machine code, from LLVM, as well.
Just heard AMD also has a big effort into heterogeneous computing, and they have a translator
from LLVM IR to their intermediate language, compiler chain. It's basically the equivalent of
their PTX, if you know anything about GPUs. And then ARM support is embedded in LLVM,
as well, so what I like to say, it's a great cooperation venue for hardware vendors. I think it's
phenomenal this has emerged. It should have emerged a long time ago but didn't because people
basically used C as that intermediate representation. Every person who wrote a piece of
hardware wrote a C compiler, and then C became this -- if you could write C language, that
became the way we cooperated, even though C was certainly not designed for that purpose. It
just sort of fell into that use case.
LLVM IR is designed for that purpose [indiscernible], but certainly as a portable assembly it
makes a lot of sense, and it's a way to have this really nice separation of concerns that the
hardware vendors optimize their platform and let software writers target it and then have really
true, open standards. I tend to say that people who talk about CUDA versus OpenCL, that you're
asking the wrong question. You're still stuck in the API world and you're asking the wrong
question. That's not the concern. I don't have that concern, because I'm just going to write
LLVM IR and then have AMD generate code for their hardware and NVIDIA generate code for
their hardware from that, and I think Python can play a strong role for the intrinsics on top that
you still need, because each one will have their own intrinsics in terms of the instructions they
support, and you want to normalize that at a higher level, but at a language level, not an API
level.
Here's an example of what Numba does. Numba takes simple Python code, which if you look at
that and compare it to the C code, it's not that different, and it generates IR, which is this
intermediate representation that has no loops. It's just single-statement assignment. You
basically have an instruction and then a label for that instruction, and once a label is formed, you
don't make it again. You just basically build these blocks of instructions and then connect them
in a DAG. The optimizations are done on this IR. All optimizations basically just take this, read
it and recycle it and create better versions. I'm sure folks in here know far more about compilers
than I do, and you have your intermediate representations. Every compiler has something that
plays this same role, and so I'm sure there's ways to leverage it inside of Microsoft, as well. But
this is what we're doing, and then we used the LLVM project to -- in memory, we don't actually
emit that string. We just in memory create the equivalent C++ objects that then, from that
infrastructure, they can build machine code. So we can get ridiculous speed ups. Some of this is
a little bit of a lie, in the sense that nobody writes Python code like that. Of course, they
couldn't, because they would never wait to do image processing with for loops in Python, but
with Numba you absolutely can. You can write for loops in Python. As long as your arrays are
typed, I can write this four-dimensional for loop and have it happen instantaneously, equivalent
to as if I'd written it in C. Very exciting to me is a person who has done a lot of extending in
Python with here's my NumPy array, and now I want to do some function, I've got to pull out C.
Cython has emerged recently as a way to do that, sort of, with again, just adding type
information. With Numba, we do more type inference and we use the type information that's
already present in the NumPy arrays that it's called with. Two ways to call JIT, JIT and auto JIT.
JIT, you tell it what the input types are and the output type you expect, and then it compiles it
right there and replaces your Python function with an optimized version, one with machine code.
I think Numba changes the game, because it essentially makes Python on par, or I say Python,
but it's really a subset of Python, some typed container version of Python with a few instructions
removed -- makes that a compilable language, and makes that equivalent to as if you had written
C++, C and Fortran, too, with the star, asterisk, minus some of the optimized Fortran compilers.
We don't quite get there yet, but I believe we can. You don't have to reach for C anymore, and
for NumPy users, that's a huge deal, because even though we have a lot of optimized libraries,
sometimes you just know how to write a for loop to do your problem, and you just want to write
a for loop to do your problem, and now you can do it, and you don't have to kind of go through
any motions or learn other language or try to figure things out. I have multiple examples of this.
This is adapted from something from Prabhu, Performance Python, where it's just solving a
Laplacian equation, del-squared equals zero. You have some boundary conditions and it's an
update mechanism. You just find the average at every point and iterate and keep doing that, and
here's the update formula. There's two versions here shown. One is sort of the raw -- you write
all the for loops out and use index expressions, indexing, to get to the elements of the array
object.
Here's the way people would have done it in NumPy without doing this for loop, and there's
benefit to this. And so part of my --initially, we wrote Numba, and then there's this pain in my
heart that goes, wait a minute, I'm just telling people to now unroll their loops. And, in fact,
there's a Julia blog post where that's what they say, too. They say de-vectorize your code. But
you're losing something there. Sometimes it's a great idea, but once you learn the syntax of
slicing, this is easy to read and easy to understand, and it's at a high level and there's more
information there, I think, that we can use to optimize it. So kind of forcing people to say devectorize your code, I don't want to force people to do that. So in Numba, we actually support
array expressions, and you can write this, and then we generate the code for you. Instead of
using NumPy slicing, we actually write the LLVM code that does the equivalent of that slicing.
It has the benefit of no temporaries, too, because that's a big problem with NumPy right now, is
an expression like this, you create a lot of intermediate memory, and that ends up slowing you
down. Most of the performance problems come from that. Most of the speed up shown here
actually comes from just the memory allocation differences. So the results, this just gives you an
idea. The point of this slide is to communicate that with Numba, we're getting to the same speed
as any of the other technologies that are out there, Weave, Cython, writing C, and even looped
Fortran, Fortran where you actually write the loops. Now, what we're not beating is the array
expression Fortran. Compilers will take an array expression, much like the lower point here.
Basically, if you change the syntax slightly, this is how you can write Fortran 90 today, and the
Fortran compilers will still be faster. We haven't done a lot, and again, Numba is a new project
still. It's not that heavily funded. We're a startup, and we've got a few bright guys on it, but I
believe we can get faster, and to me, that's the target. It's that vectorized Fortran is the target
code we want to generate. Now, aside from Numba, llvm-py is just looking at, too, because
Numba is a particular entry point to taking Python code and a certain kind of Python code and
writing machine code. Llvm-py is just a Python wrapper over LLVM Project, all of the LLVM
APIs, very useful. Take a look at it. You can basically write a compiler in an afternoon. I took
Dave Beazley's compiler course, which he teaches in Chicago, very enlightening, very
instructive. He had just converted to start using llvm-py as his back-end code generator, and it
was amazing to me how simple it was to build a compiler for whatever I wanted, whatever
language I wanted, by just using Ply and llvm-py, I could get the final language, get a machine-
code-generating compiler out of it in an afternoon, basically, so very helpful just to look at in
terms of a tool.
All right, so this brings me to Blaze, and so I'm really short of time, and so I'm sorry, I really can
talk for a long, long time. I apologize. I've got a lot of material, perhaps. Lots are drawn with
NumPy, and so we started Blaze. Probably the fact that it didn't work on distributed data was the
biggest one. The objectives for Blaze is to basically create more flexible array objects, having
variable-length dimensions, having missing data as a more fundamental value. Type
heterogeneity, not having it have to be exactly the same thing all throughout the data. Probably
the most important thing is that with Blaze, and why it kind of has to be a new project, as well, is
NumPy has a large user base, and people expect immediate mode in NumPy. You make an
array, A plus B, they expect to get an array out. They expect something to happen immediately,
but to do the kind of work we're talking about, you can't get something out immediately. You
have to build a deferred expression graft. So you have to use expressions, NumPy expressions,
as a syntax for building up an expression graph. So with Blaze, everything is deferred, basically.
All your operations are deferred until you say eval or until you actually need the data out. So we
build an expression graph. The other thing we generalize is the type system. We have many,
many more data types, variable-length strings being one of the biggest ones, enums. The other
thing we do is we actually have a C++ library that's the foundation, and so it could actually be
used anywhere, instead of just for Python. So actually, NoJS integration with that C++ library is
entirely feasible, so you can get array-oriented computing in whatever language. And then to
handle heterogeneity, we merge the type and the shape so that they're literally the same thing.
We call that data shape. And then I'll mention briefly a project called PADS from GE Research.
I was really thrilled to find it this summer, because what happens is, as you start thinking about
moving code to data is you think a lot more about data description languages and a lot less about
type code. You think about I need to describe the data that's there, so a computer can create code
for it, I can pick the right instructions for it, and PADS basically was exactly that. It had a
slightly different user story than what we were contemplating, but they'd fleshed out kind of an
extension to -- based it on C notions and they'd kind of built out a data description language.
Super-excited about that. We're going to incorporate that into Blaze data shape, which is already
a -- I wouldn't say complete data description language, but a fairly significant data description
language for anybody doing algorithm development on most data. Certainly more broad than
Thrift -- Apache Thrift or protocol buffers or even the Captain Proto, next generation.
The big story of Blaze, though, is it's trying to unify multiple data sources, so the user stories
have been -- I have a directory of images that I want to understand as a single array. I have a big
directory of JSON files that I want to see as a single array of data, but I don't want to suck it in. I
just want to layer understanding over the top of this raw data. Understanding so that I can still
write high-level expressions and have results come out, and so obviously you have to actually
pull data in to do those results, but only when I want to do them, not all the time, just to even
understand and slice the data. So I need a synthesized view on a client sitting on top of multiple
data sources. So we have this notion of an array server that sits next to the actual data, whether
it's a database, a collection of files or even a GPU node. Data on a GPU node, of course, the
arrayed server would be on the host, but telling you what's stored there, so you can actually move
the code there. Progress on Blaze, we spent a long time trying to understand the space, trying to
do a lot of experiments -- 0.1 was released in June, 0.3 is supposed to be released in a week or
two, and that's the one I'm telling people is the first usable release, and by usable I mean you still
have to be kind of brave and willing to explore with us. If you're a NumPy user, I'm not saying
all NumPy users come start using Blaze. That's going to be a few months from now, at least, six
to eight months, I would say. Basic calculation work out of the box, we generate universal
functions or the equivalent with Numba. We actually generate an expression graph and compile
that using the optimizers in LLVM and get really fast results. It's nice. We do have a hard
dependency on this underlying C++ library we're calling dine [ph], and dine [ph] Python is a
wrapper on top of that. There is some discussion about potentially NumPy 2.0 actually being
dined [ph], but that will be a community discussion and we'll see where that goes. It could go a
lot of directions. And then we have a persistent layer called BLZ that we spent a bit of time on,
not because we want to have our own file format, because it guarantees us a columnar storage
tool in case somebody hasn't chosen one. And it gives us something to test on. So BLZ is
basically a columnar storage persistent store for Blaze arrays so that they can have these largescale arrays talking. And you can query on it and do operations. You open it, it doesn't mean
you read the data. You just do a query and that query happens out of core. It does it in
streaming chunks, it only pulls in the data it needs and tries to keep the cache hot as much as
possible.
But the one I'm really -- this demo kind of actually really illustrates what we're trying to do with
Blaze, and that is this. Basically, I have here a directory of JSON files, which are describing
Kiva loans. Kiva is a microlending set of tools, and it generates a lot of data and they're very
disparate, and it's not really uniform. You can see that this data shape -- data shape is our syntax
for the size and the type together. It's basically like your struct syntax. You can see that this has
1,000 -- excuse me, just over 1,000 of this, and the type is this huge, very nested, structured
thing, but it's laid out there, and I can go in and I can click on one of these, like the loans, and it
dives in. It drills into all that data, and I'm not pulling all the data in. I'm simply just going to
that section, and now I have a variable number of dimensions of this, and you can see all the data
type for it. So it illustrates this, once I specify the data description of what's stored, I can quickly
slice and dice and pull and grab and then do operations on those.
So as the slide indicates, very quickly, from a data shape and raw JSON, I have a web service on
top of those data. Very easy to manipulate and read. Like I said, DARPA is providing help for
that. So the last part of my talk, I just want to talk about a few of the visualization tools, because
like everybody knows, nobody really cares until they see what you can show. Everybody wants
to see pretty graphics and pretty viz, and so we -- part of the DARPA grant, actually, one side of
it's the analytics. The other side of it is visualization and how do you present that to users, and
Bokeh is our plotting library. The best way to show Bokeh is just to go to the gallery here, and
the gallery, if you just go to the Continuum repository and see bokehjs, it's basically CoffeeScript
and therefore JavaScript in the back end. It will take you, have a link to this page. And you just
go in here and you can click on one of these graphics, and it's an interactive graph. Click on
zoom, and you can zoom in and out of the graph. You can preview. I can resize, shrink the
whole graph or change it, click on pan. Select doesn't really make sense for this one. So that's
what I mean by interactive. It's got pan, zoom tools sort of attached to the graph.
But more than that, bokehjs also has -- it sort of has this ability -- we have this demo showing
just the fact that it is really interactive. And what I have here is a server running a sound
spectrum demo, so I'm sampling the microphone right now, and real time it's able to do the
Fourier transform on the server side, give the data to the browser and show the result. It's
showing several things here, actually. It's showing also this radial plot of the spectrum. There's
a few sliders here, and all these are doing -- when I say parameterized technical workflow, the
technical workflow is taking the sample card and doing the FFT, but parameterized by a few
variables, and I'm adjusting those variables via the browser, changing the frequency range so I
see the different data. I'm changing the gain so I can not clip, and then doing a 2D plot and
doing many, many updates at the same time. All this is bokehjs in Python. Yes, question.
>>: Based on Canvas?
>> Travis Oliphant: It's based on Canvas right now, HTML5 and Canvas is bokehjs. WebJL
integration obviously is of interest going forward. We're -- one of the reasons we've written this.
Now, Bokeh itself -- this is bokehjs, which is just a JavaScript library. In fact, could have
bindings to Ruby or whatever you wanted to use. It's a JavaScript library. Our purpose is to
actually make that very, very accessible to Python developers, so that they don't have to write
bokehjs code. They can write Python code and just do plotting and have it show up in the
browser. And I'm really running out of time, so I'm going to have a hard time showing you all
that story. But I can take you to Wakari. Wakari is kind of the way we bring this all together as
a platform. You sign in as a developer or -- they're free accounts. Free accounts don't give you
very many compute resources, but it gives you something. Basically, it gives you 512 megabytes
of disk -- or excuse me, of RAM -- and basically one, two gigabytes of disk space.
And what it is, it presents to you an environment for writing data analytics code in the browser.
So it comes up and it gives you an environment, and presently, the default is actually an IPython
Notebook with a file manager, so you can upload and download data, although the intent is really
to use this to handle data that's already in the cloud, not to be really moving data to it, but you
may have some data that you need to move up and down. And then you write a notebook, and
then you can share these notebooks, which is the big feature. I can go to my account, and you
can see notebooks I've already shared. Anybody can go to my account, and you can see I've
shared a lot of these Numba notebooks, which show kind of how to use Numba to do the
equivalent of, say, writing special functions in SciPy. A lot of my work in SciPy was wrapping
code written in C and Fortran, and here I'm showing how that same code could have been written
in Python and give you the same speed. Anybody sees this, they click download this notebook
or run/edit this notebook, it will open in their Wakari environment. They can instantly reproduce
what I've just done. And part of the story there is not just the code, but also the environment that
it runs in, and that's the story of Conda, is we capture the whole stack of what is needed to run
that notebook, not just here's the notebook, but also it uses these packages, so you can share that
whole thing and somebody can quickly download it and install it. They don't even think about
installing it. It's just all of a sudden they can run your code, and they have an environment set up
that runs your code. So that's Wakari. Its relation to Bokeh is in the plotting, so there's a web
plot tool, and you can basically -- from the command line, you can build plots. And you can see
some of those plots here. These show up because of Wakari. I don't want Tweet Deck. So
finally, I'm going to show one more demo, which is CDX, which is running already. Just have to
go to the right place.
And CDX is our Continuum Data Explorer, and it's basically bringing table views and plot views
together into a single box, and I just have to figure out where to go to see it. Port 5030, that's
right. So if I go to local host, demo, then this brings up the CDX Data Explorer, which you have
data here that's stored. You have a table view in which you can do group by operations and you
have plot views, and I can bring up some of the plots I've already made. And these are Bokeh
plots. And they're interactive in the sense that I can select on these plots and have them update
in the table, although it's not working for me right now, explicitly. I think my server -- I can
debug that later. CDX is also a 0.1 product. It's very new, but it is available on GitHub. You
can download and install it. There's instructions on how to get it running. Yes, question.
>>: Can I add computed columns to this?
>> Travis Oliphant: Yes, you can add a computed column.
>>: So I don't need Excel anymore, basically.
>> Travis Oliphant: I wouldn't say that in this audience, but certainly one of the motivations -but I'm sure you're thinking about this. I mean, obviously, a lot of science is done with Excel. A
lot of people use Excel, and the question about how do you take Excel to the next level is one
that's of relevance, and I think there's a lot of ways you can merge what you're doing with Excel
with what you can do with Python and kind of have the best of both worlds. I think an Excel
front end would be excellent.
>>: In our group, we actually did a Python to Excel bridge, so it's called Pyvot, it's open
sourced, available on the CodePlex website, and essentially it's a live two-way bridge between
Visual Studio, between Python and Excel, so it seems them all together.
>> Travis Oliphant: I'm excited about that, too. We'll advertise that one, too.
>>: And Data Mitro [ph] we believe took that and did a startup on it and they're doing very well,
from what we hear. They just got funded $5 million. They're selling like hotcakes.
>> Travis Oliphant: Nice. Yes, question.
>>: Do you have any tools that support large data simulation?
>> Travis Oliphant: Large data simulation?
>>: Like if I wanted to pretend I had a database up there with 17 trillion records, but I only have
->> Travis Oliphant: Oh, I see. Not specifically. I mean, you could certainly write something
like that with Python in a day or two, but no, not specifically.
>>: And could you say something about your licensing?
>> Travis Oliphant: Oh, sure. Everything I've talked about here is basically BSD licensed.
We're very careful about that. We have commercial clients. We do have available GPL
packages, obviously. We try to keep those in repositories that are specifically labeled GPL so
that someone can add those to their repository index if they would like them but also can exclude
them, if they'd like, as well. We use a BSD or Apache. Either one, but BSD mostly, just
because that's the traditional for a long time in the Python science world.
So that's basically the story. I didn't finish all of the slides, but it just goes through some of these
other things. But I just wanted to show you kind of some of that and then answer any questions
you have.
>>: You mentioned the hype around big data. Is big data even defined, or how would you
define it? When you remove the hype?
>> Travis Oliphant: It depends on who you are. To many, big data is it doesn't fit in Excel.
That's probably the 85% power law distribution, that's what they think of big data, is does it fit in
Excel? So what do I do now? I don't have a better definition than that, other than typically my
tools -- more generally than that, the tools I'm used to don't work with this data. For a NumPy
SciPy user, it might be much bigger, let's say, because I'm used to dealing with gigabytes of data
as a NumPy, SciPy user, and it might mean depending on the machine I'm on, because a lot of
people can just buy a bigger machine, put a terabyte of RAM in. They can do big data very
easily.
>>: Ten, 100 and 1,000 gigs are usually the ->> Travis Oliphant: 10, 100 and 1,000 gig is what you all use?
>>: It's like different pockets of people say that's big data. And for some people, that's jump
change.
>> Travis Oliphant: I've seen 10 terabytes as kind of a boundary. Petabyte, certainly at this
point, I think everyone would agree is big data. In Texas, we just call it data.
>>: Does CDX have support for doing something when you back, say, a billion records and it's
too big to fit in a traditional Canvas?
>> Travis Oliphant: Right, so not directly now, but that's really essentially the effort of Blaze.
Blaze is an execution engine underneath, and CDX is the front end for that, because CDX uses
Pandas, leverages Pandas, leverages NumPy, and those are in-memory sort of tools. But the way
it's architected with references, you look at the CDX, you're plotting actual strings, you're
plotting references to these tools. So I can put a computed column in here that divides -- HDIV
A is one I like to use, which is HP divided by Accelerator, and there are still some bugs here.
And I can plot that directly and the plot shows up. Great, cool. It was supposed to show up
before.
>>: It's less of a technical question. It's more kind of a visualization question. It's how do you
look at big data?
>> Travis Oliphant: How do you look at big data. Well, you would be interested in these slides,
which I didn't go over, which are about abstract rendering. Probably the most interesting thing
that came out of the XDATA work this summer, which is a way to actually talk about visualizing
huge data that doesn't fit into memory by doing this abstract rendering pipeline. Yes, you can
find out about that online, and we can also talk about that, if you're interested in that particular
aspect. This is the work of one of our subcontractors at the University of Indiana, as well as
Peter Wang. Thank you. It's been great to be here. I'm happy to answer questions for as long as
we have the room.
Download