>> Wenming Ye: Hello. My name is Wenming... Research. Today, we've invited Travis Oliphant, and he is...

>> Wenming Ye: Hello. My name is Wenming Ye. I'm a program manager here at Microsoft Research. Today, we've invited Travis Oliphant, and he is a Founder and Chairman of NumFOCUS, which is a nonprofit foundation for -- an umbrella nonprofit foundation for a growing set of open-source Python software for numerics and analytics software, and he is also the author of NumPy and SciPy. He has 15 years of experience working with scientific computing and Python. He has just also founded a company called Continuum Analytics, and that company, he is the CEO there, and he focuses on data analytics and tools for services with Python. And welcome, Travis. >> Travis Oliphant: Thank you, Wenming. Appreciate it. It's a real treat to be here. Glad you're here, those in the room and those who are joining us remotely and time lagged, as well, on the DVR, I guess. I may be speaking to you in the future. Happy to be here to talk basically about the role that Python has to play in "big data" analytics, and I put it in quotes intentionally, because it is a term that's used, and so we have to use it, but like most of you here, I know there's a lot of hype around it and some truth. So we're not going to explore that entirely. We'll cover it a little bit, but mostly I'm here to talk about where I think Python is going to play a major role and kind of some of the tools we're building to try to make that happen. First, just a little bit more about me to connect -- maybe some of you have similar backgrounds. I'm really a scientist at heart. I started a master's degree in electrical and computer engineering. I studied satellite remote sensing. Here is just an image of a satellite with big beams, returning scatterometry data over the Earth's oceans, and from that you can find the wind speed. That was surprising to me at the time. It probably isn't surprising to many of you that the waves, they adjust based on the wind speed, and then the scatter that comes back to the satellite changes. But it's sort of the first example and my first taste of big data problems, and back in those days, it was a tape, and we had a big tape coming from JPL, and we just plugged it into this reader, and I had some Perl scripts that drove MATLAB codes. It was the first and only time I wrote Perl. I have an anecdote about Perl and why I love Python so much, because the Perl scripts I wrote I didn't understand after just six months, going back to the same codes. I had no idea what I'd done, sort of right once, read never. My background really is science. This is the kind of image you can produce based on those satellites' data. You have just global Earth wind speed images. This same satellite could also measure ice over the Antarctic. It's just a scatterometer, backscatter, and you just adjust what you're inverting for and you can get different images. So really enjoyed that. I stayed for graduate work, did the PhD at the Mayo Clinic, and there I studied wave problems again, but now these are waves of a different sort, vibrating inside the human body, and with both MRI and ultrasound, you could take a picture, a snapshot, and get the full threedimensional -- actually, five dimensional, if you think of time as one dimension and then the polarization as another. I can get five-dimensional data inside the body of a wave propagating, maybe at 1,000 hertz or 400 to 1,000 hertz. So just attach a speaker to somebody, push and wiggle them, and you can see this wave propagating. It was fantastic. And my job was to actually do the inversion. I was trying to estimate -- I put that up in front of people at Mayo Clinic and they get scared of all the indexes. I'm sure here people aren't scared of some of those indexes. That image is here credit to the folks there at the Mayo Clinic, but this is the kind of data I dealt with, and then I would have to essentially get this five-dimensional data set, and I started to want to find derivatives of those data sets, so how to do that. And that's really what led me to Python. I was using MATLAB at the time, and I really liked the way MATLAB let me think at a high level. I didn't have to become a programmer. At the time, I had already done C, I had done Pascal. I did know how to program, but when I'm thinking about science, I didn't want to have to be thinking about pointers and abstract objects. I wanted to think about science and not get my brain full of other things. Most things about software engineering are really about neuroscience. They're really about us as humans and how little we can keep in our head or how much we can see on the screen. One day, I think there will be a department of neuroscience, computer science, so there will be actually programs where people study software engineering as a social or as a neuroscientific enterprise, understanding what it is about languages and why are some popular versus others not. It has to do with short-term memory and how much you have to keep in your head, versus how much you can understand. At the time, I wanted to think about the science I was doing, trying to invert that wave equation. I didn't want to think about all of the memory pointer chasing. And so I loved MATLAB, but it was not -- it wasn't hanging together with the big data I had. I had five-dimensional data, it was filling up RAM, filling up disk. I found Python, and so I started to use Python at that point back in '97, '98, and I haven't really turned back, so that was 15 years ago. At the time, Numeric existed, and it allowed me to do those same kind of high-level operations in a scripting language or a high-level language. So over the years, though, all the work -- I ended up doing a lot of work on just improving the ecosystem of tools around Python, to make it more accessible to even more people who weren't as familiar with programming as perhaps I had been. So I ended up developing, starting the SciPy project, spent a lot of time developing that project, in the process realized that the Numeric array operator had to be merged, there needed to be some changes made, so I wrote NumPy in 2006. All that work essentially turned me into a software developer. I still feel like I'm a scientist at heart. I was just at Los Alamos last week, and I always feel at home. I go back there and there are FORTRAN compilers and they're doing big NPI parallel runs on their supercomputers, and I just feel right at home. I love the conversations. But I can sometimes speak to the software developers among us, as well, although I've been playing catch up a little bit most of my career as a software developer, trying to understand all the things you all learned in computer science, while I was busy studying electromagnetics and MRI. Continuum started as a company -- Peter Wang and I founded the company in January of 2012 basically after watching NumPy and SciPy at a lot of large organizations. Many large organizations in oil and gas, Wall Street, engineers, at places like Procter & Gamble, Johnson & Johnson, and seeing how they were using it and realizing a lot of the same reasons were the reasons I used it, but they were running into trouble, too. They were running into trouble as their data sets grew, their volumes grew, and there were these other initiatives taking place outside in the big data world. And we said, you know, there's a classic Strata conference -- Peter Wang went to Strata, and at the Strata conference, Hadoop was everywhere, and everyone was talking about Hadoop. And there was almost no mention of Python at all, in the whole conference, at least visibly, publicly. However you go to the actual sessions where people are talking about what they do with those data, and every one of them were using Python as kind of the end result, as the back end, in-memory data analytics they were doing. And we thought, at the time -- I realized I'd watched people use NumPy, realized where it was falling short, changes that needed to be made, and thought we can make some fundamental changes to the way people are using array-oriented computing and actually map that to big data problems. We can actually do the same kind of things that people are doing, MapReduce or large-scale operations for, so we had this vision, had this idea, and started the company to really basically bring NumPy and SciPy to big data, is one way to think about it. Now, it's evolved since then. I just give a brief shot of the team as its grown since then. We're now at about 34 people. Developers and scientists, that's one of the things we love to do, is get people who not just only have a developer background but have a scientific background or an engineering background or a domain expert background. You'll hear me use the term domain expert. You use that here, I know. SME, subject matter expert, these are the terms that computer scientists applied to people like me 10 years ago, when I was more doing science. Our big picture is to really empower people like I used to be, or am -- would still love to think I am sometimes, a scientist, somebody who has a real problem to solve. They don't want to spend their time thinking about software and development and pointers and development environments. They want to spend their time thinking about math and science and big problems they're trying to solve, but they have to use computers. It's a big part of their problem, and so how do you build a platform or an experience for them that lets you take their expertise and move it to the big data that's available and then the hardware that's available? The other fascinating thing that's emerged over the past 15 years is hardware has gotten more parallel, more dense, and we're not taking advantage of it very well. Even at a time when that kind of hardware could absolutely help our scientists and our experts, we're still not taking advantage of it very well. I know I don't have to tell this crowd. You guys could probably teach me a thing or two about what you're doing to make that possible, but our goal is to build a platform that makes this happen. So part of that is we're big backers of NumFOCUS. At the time we started Continuum, I also, with a bunch of other open-source participants, people like Fernando Perez of IPython, the late John Hunter of Matplotlib, Perry Greenfield and Jarrod Millman -- we organized the NumFOCUS Foundation, which is essentially a lot of this activity in the open-source world around NumPy, SciPy, had been developed basically by grad students in their spare time and maybe an assistant professor or two who wanted to give up their academic career instead of -- in order to promote open-source software and tools. But this has been happening kind of under those individuals and as a very community, organic thing. We decided it was really valuable to have an organization, and we applied and got 501(c)(3) status for the organization, so we're a public charity whose whole purpose is to promote reproducible computing and accessible science, and we do other things like we have a technical fellowship and we have a -- really to kind of promote the grad students that are going to make the next generation of great tools and also to promote women in science and technology, make it more diverse. So NumFOCUS supports these tools, and we're big backs of NumFOCUS. So definitely check it out. I'm giving a talk later tonight, a PyData talk. NumFOCUS, one of the things we do is promote the PyData Conference Series, and all of the proceeds for PyData go to NumFOCUS, so we try to get people to sponsor PyData, to sponsor NumFOCUS, and all the proceeds go to building the tools and making them better. Now, as a company, Continuum, you'll see me talk today about a lot of stuff, and all it's open source, what I'm going to talk about today. Well, there's a tiny slide that talks about some things we sell. We do a lot of open source, but we have to pay the bills. What we do as a company is products, and we'll see a couple of examples of that, but not much. I won't go into much detail about those. We do training and then support and consulting around the Python for Science and Analytics and Technical Computing. A little bit about me, a little bit to how I got to where I am here. I want to talk about big data now and a little bit about the hype associated with big data. This is kind of a curve. It's from Gartner, actually, in July of 2012. Many may have seen it. It shows the hype cycle and kind of where some of these big data technologies are -- artist's rendition of what a hype cycle looks like. You've got the technology trigger, the peak of inflated expectations, the trough of disillusionment, the slope of enlightenment and the plateau of productivity. And kind of depending on what you're talking about, you're sort of all over the place there. So there's certainly a lot of that around big data. In fact, there's a great blog post -- I don't think I included it here. I did, later. I'll come to it later, people talking about sometimes people hear all the hype and just think they need to use big data tools when they don't. They really can get away with the traditional tools they're used to, but everyone is trying to do it right or do it the way the other guy is doing it. As you all know, there's a lot of misinformation, and a lot of what you have to do as a software developer is try to educate and help people understand what they can do and what's available. But one thing that is not a hype and what is happening is there is this collision course between sort of the traditional HPC high-performance supercomputing world and then the big data business analytics world, where people from those communities are wanting access to advanced analytics. That's the term they use. They want advanced analytics, and they know that's essentially linear algebra at scale, linear algebra across multiple machines. So there's this collision happening, and it's interesting to watch sort of as that happens and technologies get absorbed and not and confusion reigns most of the time, what's happening? Because a lot of the problems they're solving have been solved at the big data centers before, but under different circumstances. Maybe HPC, high-performance computing, whereas a lot of the big data discussion is around high-scalability computing. It's about fault tolerance. It's about can one machine disappear and the other one show up, whereas HPC supercomputer centers have always been about, no, if it goes down, it goes down. We're going to make this thing work. It's going to be stable and therefore millions of dollars. But there's an emergence happening of tools, and our belief, our strong belief, is that Python actually plays a strong role in the bridge between these two worlds. We're part of an XDATA program that DARPA is sponsoring, and actually the role we're playing in that organization, or in that program, which is a collection of 24 different groups around the country, all doing big data, some NPI based and some Hadoop based or Spark, really, or Shark or MapReduce based. We're kind of bridging the gap between those and being the Python story in that space. So I know what I'm talking about a bit here, but I'm also trying to sell a certain story, and obviously the space of big data is big enough that a lot of stories can be told. The story I'm telling is that Python has a big role to play as unifying a lot of these technologies. So why Python? So a couple of slides here, talking to -- maybe some are familiar with Python, maybe some aren't. Python, the biggest reason is really the reason I -- think of my story. I was a scientist, a graduate student. I wanted to solve problems. I didn't want to pull out my C programming experience and be a developer. I went to Python because it was easy to learn, the same reason I went to MATLAB originally. Easy to learn, had a lot of libraries associated with it. So domain experts can learn it, but yet, at the same time, Python, and Python has this and other domain-specific languages don't, is it's powerful enough for software developers to actually build systems. So Python is this very interesting place where domain experts and software developers kind of merge and come together in a very productive and useful, collaborative way. I've seen firsthand examples of that, both me interacting with the software devs of the Python story. I'm actually still a Python committer. And kind of changes they would make to satisfy the needs of scientists and watching that kind of emerge over the past 15 years has been really inspiring. The other aspect, though, of Python, you could say that of several other languages. Other languages perhaps could also fill that role. Python sort of gained critical mass. It's got a mature library, an ecosystem that's very, very large. There's over 30,000 packages on the Python Package Index. I'm sure it's a power law curve as far as how many of those packages are actually something you'd care about, but it's growing hundreds and even thousands of packages that people actually use every day. Very large community of users in any domain that you're looking for, and I underlined syntax matters, because I say it over and over again, and it's coupled to the first bullet point, which is it's easy for people to learn. And syntax does matter. There's a lot of really great languages out there, and I agree, as a software developer, I think you should learn more than one. I'm not saying everybody who's a software developer should only learn one language. Haskell is a great language. Clojure, Lisp, these are great languages, but the syntax is a little less accessible to the domain expert, and for that reason they kind of -- okay. I'll hand that over to my programming buddy to make him figure that out. But Python is one that we'll actually experiment and say, I can get my hand around this. I can do this. And it's because it leverages their English language centers. And there are some constraints there, of course, but it is a significant thing. So Python is being used a lot of places, not just with scientists. I thank Charok [ph] for showing me this slide. I have other slides that show other users, and there's users that aren't shown here that I'm very, very aware of, but you can see big names, like YouTube -- big systems. In fact, we teach a class with Raymond Hettinger. We teach a lot of Python classes, and one of the things you do is, you show up, you go there, and people kind of question, well, can Python scale? It's this little scripting language. And you just have to show them YouTube, YouTube, Dropbox. These are written in Python. These have scaled. You can scale. As many of you know, scaling has less to do with the syntax of the language and more to do with how you connect it, how you connect the pieces, how you actually set up the architecture in the system. Very large organizations, certainly NASA, Google -- so none of Wall Street is on this list, and I can walk down New York City and actually walk into most places and be recognized, actually. It's a little bit unnerving, honestly. You walk in places and they go, oh, yes, I know you. We're using NumPy and SciPy all over the place. Oh, really? Okay, sorry, sometimes. But it's been really exciting to see big investment banks adopt it, 5,000 developers, and actually those investment banks are also some of your customers. They use Windows tremendously, hugely, huge Windows users. So two of the biggest investment banks are using Python. JPMorgan and Bank of America have huge programs with Python. Most hedge funds, and they don't want me to ever tell you who they are, but almost I would say 80% of the hedge funds are using Python internally. A few interesting Python stats, also from Charok [ph], except the second one. CPython.exe, and this sort of illustrates the power that Python has for Windows users. It's a very, very large group of people who are using, Windows users and also Python. So it's a great way to build community as a Windows platform, because a lot of Python users, it's much different than, say, other communities that are sort of very, very Linux centric. There's a lot of Windows Python users. 21 million downloads of just CPython.exe from Python.org. That doesn't include all the distributors that also have Python. Enthought has a distribution, Active State has a distribution. We have a distribution that's newer than those, but it's been around for a about a year now called Anaconda, and it has 180,000 downloads a year at this point, even though it's only been a year old. 65% of those are Windows, so a lot of Windows folks downloading and using Python. So Python in science, and so my particular emphasis is on how Python tells a story about science, and by science, I'm pretty inclusive with that term. I think about data science. I think about anybody who's building models and trying to make predictions and then trying to get data, build models, make predictions and follow up with changes to those models. That's a lot of people, actually. I say Python is the language of science, and a lot of people back me up there, nowadays. Lots of our users might disagree. There are a lot of folks using R -- what I like to say, though, is that R sort of has the ear of the statistics department, and as well the scientists who their analytics is actually just they grab somebody from the statistics department to do their analytics. So some of the biology scientists and so forth -- a lot of attention to R in that group. Python has the attention of all the other departments in the university, so physics, engineering, computer science, all of those folks are using Python. The IPython Notebook, which you all know here and are using it productively, it has really taken off in the past year, year and a half, as just a tool for reporting, showing, talking about your scientific work, no matter what your language is, actually, and there's R hooks for IPython Notebook. Then the other new development over the past two or three years is Pandas, and Pandas is a library built on top of NumPy that makes data processing more accessible. I'll talk a little bit about it later, and it's even started to convert some of our users to Python, so I actually had a lot of conversations with the R developers over the past 10 years, and it's interesting. Some of them have come and said, we need to get people off of R, because R is just not a language that can scale very well. It was a really nice research tool, but then people are using it way past that cycle, and some of them quite adamant, how about we just get people using Python instead. Of course, that's easier said than done, obviously. People invest a lot of effort in their scripts, but it is interesting to see, there are a lot of people who recognize the benefits of having a language that's a general purpose language that can actually grow beyond an open-source community, sort of only a DSL. For those who haven't seen Python or NumPy, we can actually take questions. Does that work for the video recording, if we take questions? Happy to answer them, actually. >>: Just wanted to ask how you saw MATLAB fitting in? >> Travis Oliphant: Sure, sure. >> Wenming Ye: Travis, can you repeat the questions. >> Travis Oliphant: Yes, thank you. Appreciate the reminder. The question was, how do I see MATLAB fitting into Python and science? So MATLAB has a strong story here to play, as well. A lot of user base, a lot of folks using MATLAB. There is a strong migration from MATLAB happening right now, especially among sort of non-Simulink users. MATLAB still has a very strong, and sort of the only, story when it comes to embedded digital signal processing and embedded systems. They have a very, very nice product called Simulink. And then, there's a lot of users of MATLAB, so my perspective, of course, is biased, but everyone I talk to is just migrating from MATLAB. A lot of reasons for that. MATLAB is still a great set of libraries. In fact, I've been talking to the MathWorks about just supporting Python and selling a library product into the Python ecosystem. I think they'd do very well, still, and there's some movement in that direction, actually. So I think MATLAB is going to be around for a while, just like SAS is going to be around for a while, just like SPSS. There's a lot of tools in this space, but in terms of default, this is actually the way science is published, five years from now, I see a lot of Python and a lot less MATLAB, but great question. It's hard to predict the future, of course, and so it really comes down to what people can use and what's accessible. It might take longer than five years, because actually, it really takes as long as the professors and what they learned and what they're used to. It really is. People don't change much, fast paced. Once they've learned a language, once they've learned a way of doing things, it doesn't change much. So young people are all using MATLAB. Python, some of their professors are still using MATLAB. Examples. If you haven't seen Python, sort of how it works, I borrowed this. There's two links there, actually, and I was going to go to them, but I think I won't for lack of time. If you see these later, you can go to those links. It just illustrates some syntax of how you do certain operations with Pandas and NumPy. Here's Pandas. It's basically based on a -- babynames is a data frame collected from a whole directory of all the baby names that were listed in every year from 1880 to 1990, and it has the name, the gender and the number of names, number of people named that, of that gender that year. Sort of collect all those into something called babynames, and then you want to basically add a column to the data frame that is a probability, a frequency-generated probability of how many were named this, what's the percentage of people that were named this in this year? And so this is how you group by, baby names grouped by the year and the gender and the sex, apply this function to this group by result, and the function is pretty straightforward. It just is a simple array-oriented divide, so dcount is a whole column of data, and if you divide that whole column by a single float, which is the sum of the result, that's what you're billing as another vector of data that is that list of probabilities. Here's an example of NumPy usage. This is a very simple example of just getting a linear spaced data, 20 data points, then calculating the sine, and then maybe another 500 data points, because I only have 20 samples, and then I want to interpolate to 500 samples, so you pull out of SciPy interpolate, interp 1D, do a cubic interpolation, and that returns you a function. You then call on the data, the new data set, X, and you get back samples on that interpolated grid. And then here's a plot command, and the thing I'll point out here is this is how you select out just the positive numbers. So this particular plot will show a sine wave and dots where the sine wave is positive. This other code here is actually the Game of Life, implemented in NumPy. I talk often about array-oriented computing, to try to illustrate to people how when you think about things as an array, it often simplifies the code, and the second corollary, which is still we're working on making that true, is you can speed up the code. Well, you certainly can speed it up with NumPy, but you can even, once you get a compiler on top of that, you can create more optimized code to give you a lot more information when somebody hasn't created the for loops. They've just given the expression they want passed, and many of you here are aware of those abilities and techniques, generally. But it illustrates that with NumPy and Pandas, you can write high-level code, do high-level things with very little code and quickly, and it operates -- NumPy is really a library of pre-allocated, pre-computed loops that do it all in vectorized form, very similar to MATLAB, actually. It gives you much the same result. Now, in the big data space, I call it the problem of Hadoop. I'll show my biases here just a little bit. Hadoop definitely has some -- it definitely has some positive things, but in the current hype cycle and the amount of press it gets, it's sort of far oversold based on what it can do versus what people think it should be doing. I hear a lot that Hadoop wants to be the OS for big data. I'm not even sure what that means, actually, unless all of our OSs are going to be JVM based. But the part I know quite a bit about is that advanced analytics and Hadoop don't blend very well. A lot of people just count stuff with Hadoop, and they're trying to add advanced analytics, and it's a lot of work. I think there's better solutions. I think there's a better approach. And then what's happening right now is a lot of people are using Hadoop and they don't need it, because they're led by, well, that's what everyone says I need, and so I use Hadoop to do data, and they have 600 megabytes of data. There's a blog post here that was just recently published, and they got a lot of hits, and I've seen this also in practice -- a lot of people don't know, and someone tells them, and so they go, I've got big data. I've got a gigabyte of data. What am I going to do? It doesn't fit in Excel. There's a whole space of doesn't fit in Excel but I don't need Hadoop, and that's not being communicated very well, generally, and so there's a lot of people going down incorrect roads. I think there will be a backlash to that, and probably an inappropriate one, because Hadoop does have uses. It does have use cases. When you have really big data that doesn't fit across a single machine or a single disk. I still think there are other, better solutions than Hadoop, even in those cases. If you do need Hadoop, I say give Disco a try. I've seen a lot of people use Disco very productively. It's got a lot simpler of an interface. It's not JVM centric. The fact that it's written in Erlang is really hidden from you. It's not really front and center. You can write MapReducers and whatever you like. It does do the MapReduce part. There's other solutions to HTFS, and this is one thing I'm kind of interested in over the coming years, is Red Hat and Ubuntu, and I'm sure you have a storage solution here, as well, and Amazon. They're sort of these key value storage solutions that are emerging in the data centers, already, that really serve the same purpose that HTFS does for the private clusters. Red Hat has GlusterFS they're promoting. I'm actually a big fan of Ceph, CephFS from Ubuntu, it's got some really interesting technologies in it. Swift is the OpenStack equivalent of S3, and I'd love to get more familiar with what Windows has in this key value store, the Windows Storage Solution, Azure Storage Solution, so I think that's a great thing to be thinking about and storing your data in. And then there's a lot Python wrappers to HTFS, as well, in Hadoop, that can really take a lot of the pain away from you with interfaces to MapReduce. Now, one thing Hadoop is doing well is it's mapping code to data. Their distributed file system is connected to the scheduler, and so when you do a MapReduce problem, it does try to move the portion of the code you want to run near to where the data is, rather than pull the data around, which is a lot of organizations I go to -- for example, I spend a lot of time on Wall Street, and one of the big problems I've been a part of solving is credit risk. So people are out there, they're trying to understand, as you can imagine after the 2007, 2008 crises, what's my exposure to these companies that shouldn't be failing but actually can or maybe do? Not even companies, but now countries, what's my exposure to Greece? What's my exposure? These large investment banks actually do tens of thousands of trades every single day that are over the counter, meaning there's no exchange. There's just a phone call with a salesperson, saying, hey, I want to do a deal. And there's basic terms of those deals -- basic, common terms. Those deals are all rolled up, and now you have this exposure, this partner, counterparty, with whom you have a lot of deals. And you want to be able to, on a regular basis, roll up, well, great, how much profit am I making from you, yee-haw, and then there's another group going how much profit are we making and are we expecting them to pay us? And maybe they won't, if they go out of business, and so what's our exposure to them? Those calculations have to be done regularly -- in fact, as soon as possible and connected to the trader who's making it, ideally. That's a lot of data. To really do it correctly, you've got to have all the firm's data valuable, accessible and ready to go. In fact, I know how to solve that problem, basically, with a single array-oriented solution that takes about 20 lines of code, and it's really quite simple. It's really quite simple, if you can actually organize it all together, but they spend millions of dollars -- actually, they can't solve it that way, because there is no -- it's about moving the data. There is no place to store data like that and then do that expression on it. So they spend a lot of money that effectively comes down to grabbing this encapsulation and serializing it down to over this encapsulation and this object and pulling it over to this object and this database until your head spins and you're thinking, how is this even working? And it's very unstable, it doesn't work that well, and so that's actually one of the motivations for some of the things we're trying to do, is help build -- empower the domain experts to still think at a high level, but then we'll have the system actually organize the data correctly and well. So that's -- and these folks aren't even thinking about Hadoop. To them, you stick around the Silicon Valley crowd, you think Hadoop has won and everybody's using Hadoop. You go to Wall Street, you go to the oil and gas companies, you go to big engineering firms, they don't even rally know what Hadoop is, still. Do I care about that? And most of the time the answer is, no, you don't, because it's not going to help you with your fundamental problems. So all of this really comes down to the idea that data has mass. This is not new. A lot of people know this. But what are the implications of this, and I think we're just still starting to understand the implications of this, and what does it mean for programming paradigms and how we actually treat the way we write software. So you can't move data around. IO is not increasing at the same speed compute is increasing. So that has physical implications and limitations. Here's a perspective, I sometimes tongue-in-cheek call it data covariance, this idea that let's stop thinking about it from the perspective of the workflow. When I'm on the train station, watching data go by, and I'm building up my workflow and thinking about how data is moving through me, that's the train station platform perspective. How do I think about being on the train, being the data, and then what happens as code comes to me? If I'm staying still and code comes to me, what does that look like? It's kind of just inverting, flipping it around, thinking about it a little differently. What does that mean for programming, what does that mean for compilers? What does that mean for the way I specify type systems and concepts. I think it actually has some fundamental perspectives, and, in fact, a lot of these perspectives are actually captured by a paper, one of whose principal authors was a Microsoft gentleman, Jim Gray. Perhaps you're seeing this. I don't even know where Jim is. Maybe he's still here, maybe he's not. But if he is here and he sees this, great paper. This has got to be my favorite all-time best paper that I've seen, addressing this question of scientific data management in the coming decades, and by scientific data management, you can just really say all technical data management to do something useful with. But how do you put the scientist back in control of his data? I won't go through all these quotes -- there's a bunch here, and if you see the slides later, you can maybe read them, but I would just recommend going and getting this paper and reading this paper. It's really phenomenal. He talks about science centers. He talks about actually the fact that data is going to be sticky. It's going to stay where it is and people are going to be coming to the data. That's been happening in the scientific world for a long time, but really quite badly, when it comes down to the tools scientists have. They've got an SSH and they SSH in and they maybe run a script, and that's the best they can do. Certainly, that would not have made Microsoft popular as a platform, if that's the way you presented to DOS users back 30 years ago, right? Here's your prompt and just go do your word processing this way, with a single terminal. It talks about metadata. You have to have metadata to enable access, and you have to have metadata provenance storing all these things. It's amazing, actually, to read this paper and realize we're just trying to catch up with him and the authors of this paper and trying to figure out -- they've really laid the foundation already. Data independence, set oriented data gives parallelism. One of our main thesis concepts is actually a lot of what scientists need to do is use databases. Databases really are the preeminent data has mass, data sits inside the database, you run store procedure, you run SQL queries and that actually runs processing on the data. One challenge that SQL really isn't powerful enough and a lot of database companies put their own special brand or secret sauce to procedures to make it powerful enough, and how do you actually take that idea of stored procedures and expose it to general computing? Scientists don't use DBs mostly because of that, because their problem is they need full programming languages. They need full power. SQL isn't enough. They needed arrays. It's actually only recently that Postgres added arrays as a fundamental type into their database. And scientists have basically been asking, aching for an array-oriented database for years, for decades. So there are a few now emerging, SciDB, Stonebraker's product that Paradigm4 is promoting is the first that I've seen that actually gets closer to this, as well as what they've done with arrays, types and Postgres. But there's basically they can't manipulate their data once they load it, and for a scientist, that's the death knell. If I put my data somewhere, and then I can't do what I want with it, I'm not going to put it there. Don't handcuff me. My data is everything. It's relay critical. So if I put it somewhere, I need my data. I need to be able to do whatever I want, whatever I can possible do with it. So how do you really provide that to folks? So if you take this controversial view, perhaps -- I don't know that it's that controversial, but that the file formats that scientists are using, HDF, NetCDF, FITS, if these are nascent database systems that provide this metadata and portability but they don't sort of have the execution engine around it, you can kind of see a fairly clear path that integrates these communities. It's a fantastic quote, because it was written in 2005. I hadn't read it until we started Continuum, but essentially what we're building with Blaze is exactly that. That's probably the best description of what we're trying to do with Blaze that I've ever seen. It's a hard problem, but we're getting there. We're making progress. I'll move on. The key question we're trying to answer is how do we move code to data while avoiding data silos? I don't want just to tell folks, okay, great, go move your data to this particular silo and then you're done, then we've got all the tools for you. Now, as a platform provider, as a data center provider, maybe you do want to say that. I'm not arguing that that's not a bad business model. I think, in fact, the next decade's battle is about where people are going to move their data and who's going to control the compute around those data. It's why every platform provider says upload your data for free, no problem. We'll pay that bandwidth cost, because it has real implications. Once the data is there, it's much more likely to use compute around those data. That's going to happen, and that's great. I think who's going to win is who provides the better tools around those data to provide scientists what they need to get their work done, and by scientists -- I use the term scientist, but you could just replace business intelligence user. You could replace any domain expert or advanced user who doesn't want to be a programmer but needs access to programming-type tools to understand the data. All right, switch gears a little bit. I've talked, so that's kind of my soapbox part of the story, I guess, and kind of the perspective from 30,000 feet, the way I see the world, but that's led to us at Continuum building a certain collection of tools, a certain collection of open-source tools that we think are very valuable and real excited about, and we've been so thrilled to be able to get DARPA funding that's allowed us to do more of this in open source. That's why a lot of these tools are open source and we can keep them open source and aren't just trying to keep the lights on by selling something else. So the five tools I'm going to talk about, and I have one here in the corner that I'm not going to talk about, which is also interesting, I think, are Conda, Numba, Blaze, Bokeh and CDX. And I'll basically talk about Conda, Numba and Blaze and a little bit about Bokeh and CDX, but I'll kind of show Bokeh and CDX more. So Conda is a cross-platform package manager, with environments. Fundamental thing about moving code to data is code is very flexible. So we do have a story that has a particular kind of code we move to data, but Conda is about whatever code you want. It's about whatever you've built, we can take that environment, reproduce it reliably and repeatably on whatever platform, whatever environment you care about. It's written in Python, with Python, but it's actually not Python-centric. It's a package manager. It can do anything. It can do node. It can do whatever you like. Talk a little bit about that. Numba is our array-oriented Python compiler. It's about people writing expressions. The goal here, the motivation for Numba, is really the fact that at Los Alamos and every other National Lab, scientists are still using FORTRAN. You talked about well, how does MATLAB fit into this? And we really should have also discussed how does FORTRAN fit into this, because if you recall, FORTRAN was a high-level language, and still is. If you look at a vectorized FORTRAN implementation of some of the scientific codes, I could have written the Game of Life in vectorized FORTRAN, and it would have looked very, very similar, actually. The cool thing, and the thing that keeps scientists still using FORTRAN is that's still how they get their fastest code. The vectorized FORTRAN compiler still produced for them much faster code than any other tool, so my motivation is to actually say, well, we have the same information in NumPy expressions. We ought to be able to produce as fast a code as they're getting out of vectorized FORTRAN and still stay in this high-level ecosystem and forget the compile step and the iterative -- the setup step. So that's the motivation for Numba, is to take array-oriented expressions and map them to not only CPUs, and this is where I think we can actually maybe even beat FORTRAN in a few years, is if we take advantage of hardware faster. If by working with NVIDIA and working with AMD and working with these companies and actually bringing online GPUs, we can work quickly to get the compiler targeting those architectures. That's an unanswered question, but it's one I think we can try to approach. Blaze is sort of the centerpiece, and partly because this is the idea that really started us, trying to figure out how to help people talk about their data, keep their data where it is. Two ways to talk about Blaze. One is, if you're a NumPy or Pandas user, this is NumPy and Pandas for out-of-core distributed data. So if your data is too big to fit in memory, you still want to do array-oriented calculations, that's what Blaze is for. Now, if you're not, and if you're sort of well, I'm sort of more for the database perspective, then Blaze is basically a general database execution engine that maps to an adapter, to an array server, so it can sit on the top of any database and present a programming environment and be able to take Python code and use that as a stored procedure for your data. Bokeh is our browser-based interactive visualization tool for Python users. It's similar to D3. We get a lot of questions about why not D3. There's reasons for that, but one of the fundamental reasons is we want Python users to be first-class citizens in web interactive visualization. We don't want to force you to have to be a JavaScript developer. A lot of reasons for that. I mean, one of the reasons for the popularity of Node, obviously, is a lot of people use JavaScript, and if they're using JavaScript on the front end, they want to use JavaScript on the back end, kind of have this single -- well, the opposite effect is true, also. You're a scientist using Python on the back end, you want to use Python on the front end, too. Now, we're not going to be able to convince all the web browsers to use Python, and we don't really need to, actually, because you can actually generate that. You can generate the JavaScript necessary to actually do the interactive visualizations but let the developer not have to worry about that, much the same way that most developers don't worry about the fact that Qt or Windows Presentation Form is C++ code, they still do it in Python and the code is generated for them that's needed to do that binding. You can do the same thing in a browser. And then last one is a Continuum Data Explorer, which kind of emerged from the tools we were writing as part of the XDATA program. In the corner, here, is a thing I'm pretty excited about. This is in the same spirit of Bokeh as interactive visualization in the browser. It's building apps in the browser. Many people build full scientific data apps backed by technical workflow, but only having to think about that technical workflow, and then just having -- and maybe a little bit about the DOM elements that you're updating, the document object model elements you're updating, you just write all that in Python and the web app is generated for you automatically. That's a little project called Ashiba that's just about ready to be released as open source. We've got some -- it's still very nascent, very new, but I'm pretty excited about it. Kind of all of these fit t as a single coherent story, actually, as our emerging platform, which we call it a rapid app platform for subject matter experts or for domain experts. It's to enable people like me, as a young scientist, people like my friends who are still scientists, to be able to build full-scale apps quickly so that their brain can focus on the research, the science, and they don't have to then go through the long process of translating that to even just get a demo up, just to get something that shows what they want to do. That platform, how do you build that? And we certainly have had that on the desktop for a long time. Python has been trivially easy to build desktop apps with these kind of tools. We want to make that as trivially easy to do it in the cloud, in the data center, in the web browser as the app tool. So that's -- it's got multiple components. Wakari, I'll show a little example of Wakari as a data analytics engine, or excuse me, as the web browser component and the infrastructure on the front end. Anaconda is our distribution of Python on the desktop and also on the server side. Binstar I haven't talked about, briefly mention it. It's our basically artifact repository for binaries so that you can easily update your Conda environments. Okay, so I can tell that I'm going to run over time here, so I'm going to have to speed up a little bit, but I think I've kind of set the stage exactly as I wanted to, and now I'll just talk about some of the technologies and happy to answer questions or, especially afterwards, too, we'll have 20 minutes to talk after the talk. So Conda is our package manager, and it solves a fundamental problem that we see everywhere we go. It's between the developers who are the people that actually make things happen in an organization -- these are the quants at a Wall Street or a hedge fund. These are the geophysicists at an oil and gas company. They're the engineers at Procter & Gamble or Johnson & Johnson or at an aerospace company. They actually make things that make the company work, and what they want is access to the latest and greatest. They love Python, because it's full of a community of people that are active and do stuff, and that means there are new packages and there's new versions of those packages and there's new things coming out every single day, and they want to use the latest and the best in their next project. And then, of course, you've got the people that have to produce this in production. They typically call those IT guys or information technology. They have to reproduce this. They want it to work and be repeatable. There's a natural tension that builds between those folks, and I think that there's just a lack of tools that have let those folks cooperate more easily, and so Conda exists as a tool to help bridge this gap between the people that want rapid development and the people that want stability and reproducibility. So Conda is full package management. It's like yum or apt-get for Linux, but it's cross-platform, works the same on Windows, Linux or OS 10. It also has this one thing that yum doesn't really do and should, actually, which is this control over environments. In the Linux world, things like Docker, IO and other kind of lightweight virtualizations are sort of doing the same thing, but Conda essentially gives you lightweight, virtual environments easily. You can build one, build an app, have it center in that environment and you know, even though somebody in another environment can install a new version of NumPy or a new version of scikit-learn, that's not going to affect your app. Your app still works, it still happens. Most of the battles that happen in the IT organization, it's really remarkable -they're battling over, well, I need this version of NumPy for this to work, I need this to work. The other way they solve this is to actually collect all of it together into a single binary and then you have 15 versions of Python. You don't do any sharing at all. Of course, this is the same DLL problem that Microsoft has dealt with for many, many years. But in Python, we can do some very interesting things. Conda's architected to be able to manage any packages. It could manage, R, Scala, Clojure, whatever. Obviously, those have their own packaging worlds. We're not going to try take over packaging. We're just trying to make it easy to build packages for them. Use a sat solver, a satisfiability solver to manage dependencies, user-definable repository. It's really quite a mature product at this point, and we've even gotten some people that are starting to use it for their own distributors. Pyzo is another distribution from German folks, and they're using Conda and think it's great. Conda is associated with an online service called Binstar. Binstar, basically, you can build a command from a recipe. You can build a Conda package from a recipe. We have a lot of recipes up there on GitHub, publicly available. And then you upload the recipes to Binstar. Presently, you upload a built package, but we're in the middle of actually writing a build server, so that once a recipe is written, it can actually build a package for Windows, Mac and Linux for you and then be available on Binstar for anybody to download. And Binstar becomes a place you can actually have trusted repositories. Anaconda will be a trusted repository. Any other organization could build their version of the distribution they think they want to have somebody trust. And you can add multiple repositories to your configuration file. So it's all about connecting people to their packages and making sure that systems stay configured. I'm sure I could talk to people here who have done similar kinds of things in probably even much more sophisticated ways, but this is meant to be a free service to all the open-source projects that are out there, as well as deeply connected with Conda and the environment notion that we have. So I believe we actually solved the packaging distribution problem. Python is getting better. Python has been a mess for a while. It's getting better, and I see a lot of work in this direction, and probably a year from now, they're going to be where Conda is today, maybe even less time, but maybe even more time, too. I talk, I know these folks. I know people in the Python world. Mostly, they have a different set of use cases, and in fact, during one conversation with Guido, because we've been lamenting over the fact that disk utils didn't solve our problem, packaging doesn't solve our problem. He basically said go write your own. Don't wait for us to do it. Just go do it. And so we took him on his word and we went and wrote our own. That does mean, though, there's Conda, there's Pip, okay, which one should I use? And the answer is, use what works. It doesn't matter. You can use Pip inside of a Conda environment. It's not either-or, it's about use what works and what helps you manage your pain. So that's Conda. Happy to ask questions about that later. Anaconda is just a collection of packages. It does have a single-click installer, very popular on Windows. A lot of our users of Anaconda are Windows users who go to our page and single-click install Anaconda and get a collection of all these packages they need for their distribution. This is the one slide I do talk about some things we sell. We do sell some add-ons to Anaconda, which are proprietary, and one is we take the compiler we have for GCPUs and we actually target GPUs. So Accelerate will actually take Python code and run it on the GPU. There are some examples. It's the easiest way to program a GPU. It's awesome. I've programmed GPUs the hard way, and you still have to sometimes. If you want to get everything for like a big matrix multiply problem, but I'm super excited about Anaconda Accelerate and Numba Pro inside of it, which targets the GPU with Python code. So IOPro, we basically sell speed and we sell connectivity, so IOPro is about connecting to your Vertica database quickly. It's about -- most typical connectors to ODBC databases, they'll bring everything into Python objects and then convert it to a data frame or a NumPy array, so you have a lot more memory use and it's slower, so IOPro bypasses that, makes a very, very fast connection to in-memory data structures that are going to be valuable for science. And then MK Optimizations, we just link against the MKL compiler for NumPy, SciPy and the rest, so we do sell that. Anaconda also comes with a very interesting thing called Launcher, so everybody who downloads Anaconda has this one place they can go, and it will show them -- essentially, it will show them all the packages. Every launcher points to one or more Conda repositories, and those repositories, each package in the repository can have an entry point and an icon, and if it does, it will show up on this launcher. So even if you don't have it installed, it will show you, hey, this can be installed by just clicking this button, and now you have it on your desktop, as well. >>: So what's missing on this slide? >> Travis Oliphant: Python Tools for Visual Studio is missing from this slide. Totally agree. This has been a project we've been trying to work on, and coming in here will certainly help motivate to get this working within the next -- hopefully, not more than a few weeks. >>: Just a related question. What do most people in the community currently use as their development environment? >> Travis Oliphant: That's a great question. It's pretty diverse, actually. Spyder is one that is quite popular. Compared to Python Tools for Visual Studio, it's not close, right? It's okay. A lot of people are going to IPython Notebook, recently. Though it's different, it's not apples and apples, it's apples and oranges. So a lot of people prefer to have an IDE still. A lot of Wing. But again, a lot of Python users use Wing, but scientific users won't, because they don't embed IPython console. They don't have a console there that's IPython aware. So there's a few new ones. Ninja IDE has come up. Enthought has a tool called Canopy that was recently released this spring and it's getting usage. >>: There's also PyCharm and PyDev. >> Travis Oliphant: PyCharm and PyDev. And actually, PyDev is very, very popular, especially in industry, and PyCharm is another one from JetBrains. >>: So not so much Eclipse? >> Travis Oliphant: Well, PyDev is an Eclipse plug in. >>: But both PyDev and Spyder now are like one-man projects that are not very, very active. I mean, somebody's maintaining it, but not the way we're pouring calories into PTVS. >> Travis Oliphant: No, PTVS has become -- I think -- well, I don't think as many people are aware of it. And as they become aware of it, they are very excited. So we're very excited, actually, about trying to promote PTVS to the Python community. I think you'll get a lot of people very excited about it. So that's kind of our distribution story. A lot of bringing code to data is honestly just nuts and bolts of packaging and distribution and kind of boring things like that., yet, what scientists care about. In fact, when I wrote SciPy, most of the effort was actually building the Windows installer, which was also most of the benefit, right, because once you have a Windows installer for SciPy, then everybody used it. But before you did, you got 10% of the group, people using it. Windows is a very popular platform for scientists. >>: Can we quote you on that? No, seriously. >> Travis Oliphant: I'm happy to have you quote me, and particularly scientists in industry, who are very quiet. That's the other thing about scientists in industry, they don't go to conferences. They don't stand up and talk about what they're doing, but you go and you look and you see, wow, there's 5,000 developers who are all deeply embedded in Windows and having big applications written on Windows. One project you definitely should be aware of is enaml, nucleic/enaml, which is currently the way JPMorgan writes all of their GUIs, and it's a very simple QML-inspired declarative syntax for writing GUIs with Python, very easy to use. I have some anecdotes to share later if you're interested. >>: And the good thing is, like, they often have money and willing to pay. >> Travis Oliphant: Correct, correct. Yes, exactly. All right, so next, shifting gears a little bit now to kind of some of the other technology we're building, and I'll only have time to talk briefly about it, and you can go online and learn a lot more about Numba. Numba is actually fairly mature at this point, even though it's pre-1.0 and will be for a few more months, because we're re-architecting it. There's a next-generation result. Well, essentially, what happens is you write Numba and you realize we've actually written a little language here and we should just formally specify the little sublanguage and actually have that be -- it's essentially, from what I see it, it's Julia in Python. It's sort of the same ideas that Julia's promoting, but it's just in Python and connected to Python very deeply. Numba was all about getting the low-hanging fruit that was out there, and there's a ton of low-hanging fruit for scientists who are using NumPy, therefore using typed arrays, and then starting to write for loops over NumPy arrays, and yet in Python. It was really, really slow. And with the LLVM project, it wasn't that difficult to create essentially a source and source translator from a certain class of Python problems, Python use cases, particularly ones where they're using NumPy arrays or other typed containers and create fast code for it. It was not meant to be a JIT, directly. In fact, we do use the term JIT, but technically it's closer to an import time compiler or something like that. I guess it's not a tracing JIT. It doesn't watch your code and then try to speed it up. Actually, you just say which code do you want to compile and which do you not want to compile, very explicit about it. So you want to take that high-level arrays and people using typed containers and create fast code, always been possible, but before LLVM existed, you would have to do a lot of the code generation yourself, and like I like to say, writing in compiler is easy if you don't have to worry about the parser or the code generator. It's what we had here, basically. We didn't have to worry about either one, so it was really nice. Numba comes from NumPy and Mamba. Mamba is a black mamba, fast. A little corny, but it's easy to say and easy to remember, so it kind of stuck. It's built on top of the LLVM library, like I said. That's really where it gets its code generation from. We leverage it heavily and so the new versions that come out, and it's also why we can target the GPU, because essentially, NVIDIA, their whole compiler chain, NVCC, actually uses LLVM to generate PTX. That's what they're using. So it's Apple, of course, has put a lot of money into CLANG, and it still works on Windows, which is great. There's enough people who have put making CLANG work on Windows that LLVM works on Windows, and you can build Windows binaries, machine code, from LLVM, as well. Just heard AMD also has a big effort into heterogeneous computing, and they have a translator from LLVM IR to their intermediate language, compiler chain. It's basically the equivalent of their PTX, if you know anything about GPUs. And then ARM support is embedded in LLVM, as well, so what I like to say, it's a great cooperation venue for hardware vendors. I think it's phenomenal this has emerged. It should have emerged a long time ago but didn't because people basically used C as that intermediate representation. Every person who wrote a piece of hardware wrote a C compiler, and then C became this -- if you could write C language, that became the way we cooperated, even though C was certainly not designed for that purpose. It just sort of fell into that use case. LLVM IR is designed for that purpose [indiscernible], but certainly as a portable assembly it makes a lot of sense, and it's a way to have this really nice separation of concerns that the hardware vendors optimize their platform and let software writers target it and then have really true, open standards. I tend to say that people who talk about CUDA versus OpenCL, that you're asking the wrong question. You're still stuck in the API world and you're asking the wrong question. That's not the concern. I don't have that concern, because I'm just going to write LLVM IR and then have AMD generate code for their hardware and NVIDIA generate code for their hardware from that, and I think Python can play a strong role for the intrinsics on top that you still need, because each one will have their own intrinsics in terms of the instructions they support, and you want to normalize that at a higher level, but at a language level, not an API level. Here's an example of what Numba does. Numba takes simple Python code, which if you look at that and compare it to the C code, it's not that different, and it generates IR, which is this intermediate representation that has no loops. It's just single-statement assignment. You basically have an instruction and then a label for that instruction, and once a label is formed, you don't make it again. You just basically build these blocks of instructions and then connect them in a DAG. The optimizations are done on this IR. All optimizations basically just take this, read it and recycle it and create better versions. I'm sure folks in here know far more about compilers than I do, and you have your intermediate representations. Every compiler has something that plays this same role, and so I'm sure there's ways to leverage it inside of Microsoft, as well. But this is what we're doing, and then we used the LLVM project to -- in memory, we don't actually emit that string. We just in memory create the equivalent C++ objects that then, from that infrastructure, they can build machine code. So we can get ridiculous speed ups. Some of this is a little bit of a lie, in the sense that nobody writes Python code like that. Of course, they couldn't, because they would never wait to do image processing with for loops in Python, but with Numba you absolutely can. You can write for loops in Python. As long as your arrays are typed, I can write this four-dimensional for loop and have it happen instantaneously, equivalent to as if I'd written it in C. Very exciting to me is a person who has done a lot of extending in Python with here's my NumPy array, and now I want to do some function, I've got to pull out C. Cython has emerged recently as a way to do that, sort of, with again, just adding type information. With Numba, we do more type inference and we use the type information that's already present in the NumPy arrays that it's called with. Two ways to call JIT, JIT and auto JIT. JIT, you tell it what the input types are and the output type you expect, and then it compiles it right there and replaces your Python function with an optimized version, one with machine code. I think Numba changes the game, because it essentially makes Python on par, or I say Python, but it's really a subset of Python, some typed container version of Python with a few instructions removed -- makes that a compilable language, and makes that equivalent to as if you had written C++, C and Fortran, too, with the star, asterisk, minus some of the optimized Fortran compilers. We don't quite get there yet, but I believe we can. You don't have to reach for C anymore, and for NumPy users, that's a huge deal, because even though we have a lot of optimized libraries, sometimes you just know how to write a for loop to do your problem, and you just want to write a for loop to do your problem, and now you can do it, and you don't have to kind of go through any motions or learn other language or try to figure things out. I have multiple examples of this. This is adapted from something from Prabhu, Performance Python, where it's just solving a Laplacian equation, del-squared equals zero. You have some boundary conditions and it's an update mechanism. You just find the average at every point and iterate and keep doing that, and here's the update formula. There's two versions here shown. One is sort of the raw -- you write all the for loops out and use index expressions, indexing, to get to the elements of the array object. Here's the way people would have done it in NumPy without doing this for loop, and there's benefit to this. And so part of my --initially, we wrote Numba, and then there's this pain in my heart that goes, wait a minute, I'm just telling people to now unroll their loops. And, in fact, there's a Julia blog post where that's what they say, too. They say de-vectorize your code. But you're losing something there. Sometimes it's a great idea, but once you learn the syntax of slicing, this is easy to read and easy to understand, and it's at a high level and there's more information there, I think, that we can use to optimize it. So kind of forcing people to say devectorize your code, I don't want to force people to do that. So in Numba, we actually support array expressions, and you can write this, and then we generate the code for you. Instead of using NumPy slicing, we actually write the LLVM code that does the equivalent of that slicing. It has the benefit of no temporaries, too, because that's a big problem with NumPy right now, is an expression like this, you create a lot of intermediate memory, and that ends up slowing you down. Most of the performance problems come from that. Most of the speed up shown here actually comes from just the memory allocation differences. So the results, this just gives you an idea. The point of this slide is to communicate that with Numba, we're getting to the same speed as any of the other technologies that are out there, Weave, Cython, writing C, and even looped Fortran, Fortran where you actually write the loops. Now, what we're not beating is the array expression Fortran. Compilers will take an array expression, much like the lower point here. Basically, if you change the syntax slightly, this is how you can write Fortran 90 today, and the Fortran compilers will still be faster. We haven't done a lot, and again, Numba is a new project still. It's not that heavily funded. We're a startup, and we've got a few bright guys on it, but I believe we can get faster, and to me, that's the target. It's that vectorized Fortran is the target code we want to generate. Now, aside from Numba, llvm-py is just looking at, too, because Numba is a particular entry point to taking Python code and a certain kind of Python code and writing machine code. Llvm-py is just a Python wrapper over LLVM Project, all of the LLVM APIs, very useful. Take a look at it. You can basically write a compiler in an afternoon. I took Dave Beazley's compiler course, which he teaches in Chicago, very enlightening, very instructive. He had just converted to start using llvm-py as his back-end code generator, and it was amazing to me how simple it was to build a compiler for whatever I wanted, whatever language I wanted, by just using Ply and llvm-py, I could get the final language, get a machine- code-generating compiler out of it in an afternoon, basically, so very helpful just to look at in terms of a tool. All right, so this brings me to Blaze, and so I'm really short of time, and so I'm sorry, I really can talk for a long, long time. I apologize. I've got a lot of material, perhaps. Lots are drawn with NumPy, and so we started Blaze. Probably the fact that it didn't work on distributed data was the biggest one. The objectives for Blaze is to basically create more flexible array objects, having variable-length dimensions, having missing data as a more fundamental value. Type heterogeneity, not having it have to be exactly the same thing all throughout the data. Probably the most important thing is that with Blaze, and why it kind of has to be a new project, as well, is NumPy has a large user base, and people expect immediate mode in NumPy. You make an array, A plus B, they expect to get an array out. They expect something to happen immediately, but to do the kind of work we're talking about, you can't get something out immediately. You have to build a deferred expression graft. So you have to use expressions, NumPy expressions, as a syntax for building up an expression graph. So with Blaze, everything is deferred, basically. All your operations are deferred until you say eval or until you actually need the data out. So we build an expression graph. The other thing we generalize is the type system. We have many, many more data types, variable-length strings being one of the biggest ones, enums. The other thing we do is we actually have a C++ library that's the foundation, and so it could actually be used anywhere, instead of just for Python. So actually, NoJS integration with that C++ library is entirely feasible, so you can get array-oriented computing in whatever language. And then to handle heterogeneity, we merge the type and the shape so that they're literally the same thing. We call that data shape. And then I'll mention briefly a project called PADS from GE Research. I was really thrilled to find it this summer, because what happens is, as you start thinking about moving code to data is you think a lot more about data description languages and a lot less about type code. You think about I need to describe the data that's there, so a computer can create code for it, I can pick the right instructions for it, and PADS basically was exactly that. It had a slightly different user story than what we were contemplating, but they'd fleshed out kind of an extension to -- based it on C notions and they'd kind of built out a data description language. Super-excited about that. We're going to incorporate that into Blaze data shape, which is already a -- I wouldn't say complete data description language, but a fairly significant data description language for anybody doing algorithm development on most data. Certainly more broad than Thrift -- Apache Thrift or protocol buffers or even the Captain Proto, next generation. The big story of Blaze, though, is it's trying to unify multiple data sources, so the user stories have been -- I have a directory of images that I want to understand as a single array. I have a big directory of JSON files that I want to see as a single array of data, but I don't want to suck it in. I just want to layer understanding over the top of this raw data. Understanding so that I can still write high-level expressions and have results come out, and so obviously you have to actually pull data in to do those results, but only when I want to do them, not all the time, just to even understand and slice the data. So I need a synthesized view on a client sitting on top of multiple data sources. So we have this notion of an array server that sits next to the actual data, whether it's a database, a collection of files or even a GPU node. Data on a GPU node, of course, the arrayed server would be on the host, but telling you what's stored there, so you can actually move the code there. Progress on Blaze, we spent a long time trying to understand the space, trying to do a lot of experiments -- 0.1 was released in June, 0.3 is supposed to be released in a week or two, and that's the one I'm telling people is the first usable release, and by usable I mean you still have to be kind of brave and willing to explore with us. If you're a NumPy user, I'm not saying all NumPy users come start using Blaze. That's going to be a few months from now, at least, six to eight months, I would say. Basic calculation work out of the box, we generate universal functions or the equivalent with Numba. We actually generate an expression graph and compile that using the optimizers in LLVM and get really fast results. It's nice. We do have a hard dependency on this underlying C++ library we're calling dine [ph], and dine [ph] Python is a wrapper on top of that. There is some discussion about potentially NumPy 2.0 actually being dined [ph], but that will be a community discussion and we'll see where that goes. It could go a lot of directions. And then we have a persistent layer called BLZ that we spent a bit of time on, not because we want to have our own file format, because it guarantees us a columnar storage tool in case somebody hasn't chosen one. And it gives us something to test on. So BLZ is basically a columnar storage persistent store for Blaze arrays so that they can have these largescale arrays talking. And you can query on it and do operations. You open it, it doesn't mean you read the data. You just do a query and that query happens out of core. It does it in streaming chunks, it only pulls in the data it needs and tries to keep the cache hot as much as possible. But the one I'm really -- this demo kind of actually really illustrates what we're trying to do with Blaze, and that is this. Basically, I have here a directory of JSON files, which are describing Kiva loans. Kiva is a microlending set of tools, and it generates a lot of data and they're very disparate, and it's not really uniform. You can see that this data shape -- data shape is our syntax for the size and the type together. It's basically like your struct syntax. You can see that this has 1,000 -- excuse me, just over 1,000 of this, and the type is this huge, very nested, structured thing, but it's laid out there, and I can go in and I can click on one of these, like the loans, and it dives in. It drills into all that data, and I'm not pulling all the data in. I'm simply just going to that section, and now I have a variable number of dimensions of this, and you can see all the data type for it. So it illustrates this, once I specify the data description of what's stored, I can quickly slice and dice and pull and grab and then do operations on those. So as the slide indicates, very quickly, from a data shape and raw JSON, I have a web service on top of those data. Very easy to manipulate and read. Like I said, DARPA is providing help for that. So the last part of my talk, I just want to talk about a few of the visualization tools, because like everybody knows, nobody really cares until they see what you can show. Everybody wants to see pretty graphics and pretty viz, and so we -- part of the DARPA grant, actually, one side of it's the analytics. The other side of it is visualization and how do you present that to users, and Bokeh is our plotting library. The best way to show Bokeh is just to go to the gallery here, and the gallery, if you just go to the Continuum repository and see bokehjs, it's basically CoffeeScript and therefore JavaScript in the back end. It will take you, have a link to this page. And you just go in here and you can click on one of these graphics, and it's an interactive graph. Click on zoom, and you can zoom in and out of the graph. You can preview. I can resize, shrink the whole graph or change it, click on pan. Select doesn't really make sense for this one. So that's what I mean by interactive. It's got pan, zoom tools sort of attached to the graph. But more than that, bokehjs also has -- it sort of has this ability -- we have this demo showing just the fact that it is really interactive. And what I have here is a server running a sound spectrum demo, so I'm sampling the microphone right now, and real time it's able to do the Fourier transform on the server side, give the data to the browser and show the result. It's showing several things here, actually. It's showing also this radial plot of the spectrum. There's a few sliders here, and all these are doing -- when I say parameterized technical workflow, the technical workflow is taking the sample card and doing the FFT, but parameterized by a few variables, and I'm adjusting those variables via the browser, changing the frequency range so I see the different data. I'm changing the gain so I can not clip, and then doing a 2D plot and doing many, many updates at the same time. All this is bokehjs in Python. Yes, question. >>: Based on Canvas? >> Travis Oliphant: It's based on Canvas right now, HTML5 and Canvas is bokehjs. WebJL integration obviously is of interest going forward. We're -- one of the reasons we've written this. Now, Bokeh itself -- this is bokehjs, which is just a JavaScript library. In fact, could have bindings to Ruby or whatever you wanted to use. It's a JavaScript library. Our purpose is to actually make that very, very accessible to Python developers, so that they don't have to write bokehjs code. They can write Python code and just do plotting and have it show up in the browser. And I'm really running out of time, so I'm going to have a hard time showing you all that story. But I can take you to Wakari. Wakari is kind of the way we bring this all together as a platform. You sign in as a developer or -- they're free accounts. Free accounts don't give you very many compute resources, but it gives you something. Basically, it gives you 512 megabytes of disk -- or excuse me, of RAM -- and basically one, two gigabytes of disk space. And what it is, it presents to you an environment for writing data analytics code in the browser. So it comes up and it gives you an environment, and presently, the default is actually an IPython Notebook with a file manager, so you can upload and download data, although the intent is really to use this to handle data that's already in the cloud, not to be really moving data to it, but you may have some data that you need to move up and down. And then you write a notebook, and then you can share these notebooks, which is the big feature. I can go to my account, and you can see notebooks I've already shared. Anybody can go to my account, and you can see I've shared a lot of these Numba notebooks, which show kind of how to use Numba to do the equivalent of, say, writing special functions in SciPy. A lot of my work in SciPy was wrapping code written in C and Fortran, and here I'm showing how that same code could have been written in Python and give you the same speed. Anybody sees this, they click download this notebook or run/edit this notebook, it will open in their Wakari environment. They can instantly reproduce what I've just done. And part of the story there is not just the code, but also the environment that it runs in, and that's the story of Conda, is we capture the whole stack of what is needed to run that notebook, not just here's the notebook, but also it uses these packages, so you can share that whole thing and somebody can quickly download it and install it. They don't even think about installing it. It's just all of a sudden they can run your code, and they have an environment set up that runs your code. So that's Wakari. Its relation to Bokeh is in the plotting, so there's a web plot tool, and you can basically -- from the command line, you can build plots. And you can see some of those plots here. These show up because of Wakari. I don't want Tweet Deck. So finally, I'm going to show one more demo, which is CDX, which is running already. Just have to go to the right place. And CDX is our Continuum Data Explorer, and it's basically bringing table views and plot views together into a single box, and I just have to figure out where to go to see it. Port 5030, that's right. So if I go to local host, demo, then this brings up the CDX Data Explorer, which you have data here that's stored. You have a table view in which you can do group by operations and you have plot views, and I can bring up some of the plots I've already made. And these are Bokeh plots. And they're interactive in the sense that I can select on these plots and have them update in the table, although it's not working for me right now, explicitly. I think my server -- I can debug that later. CDX is also a 0.1 product. It's very new, but it is available on GitHub. You can download and install it. There's instructions on how to get it running. Yes, question. >>: Can I add computed columns to this? >> Travis Oliphant: Yes, you can add a computed column. >>: So I don't need Excel anymore, basically. >> Travis Oliphant: I wouldn't say that in this audience, but certainly one of the motivations -but I'm sure you're thinking about this. I mean, obviously, a lot of science is done with Excel. A lot of people use Excel, and the question about how do you take Excel to the next level is one that's of relevance, and I think there's a lot of ways you can merge what you're doing with Excel with what you can do with Python and kind of have the best of both worlds. I think an Excel front end would be excellent. >>: In our group, we actually did a Python to Excel bridge, so it's called Pyvot, it's open sourced, available on the CodePlex website, and essentially it's a live two-way bridge between Visual Studio, between Python and Excel, so it seems them all together. >> Travis Oliphant: I'm excited about that, too. We'll advertise that one, too. >>: And Data Mitro [ph] we believe took that and did a startup on it and they're doing very well, from what we hear. They just got funded $5 million. They're selling like hotcakes. >> Travis Oliphant: Nice. Yes, question. >>: Do you have any tools that support large data simulation? >> Travis Oliphant: Large data simulation? >>: Like if I wanted to pretend I had a database up there with 17 trillion records, but I only have ->> Travis Oliphant: Oh, I see. Not specifically. I mean, you could certainly write something like that with Python in a day or two, but no, not specifically. >>: And could you say something about your licensing? >> Travis Oliphant: Oh, sure. Everything I've talked about here is basically BSD licensed. We're very careful about that. We have commercial clients. We do have available GPL packages, obviously. We try to keep those in repositories that are specifically labeled GPL so that someone can add those to their repository index if they would like them but also can exclude them, if they'd like, as well. We use a BSD or Apache. Either one, but BSD mostly, just because that's the traditional for a long time in the Python science world. So that's basically the story. I didn't finish all of the slides, but it just goes through some of these other things. But I just wanted to show you kind of some of that and then answer any questions you have. >>: You mentioned the hype around big data. Is big data even defined, or how would you define it? When you remove the hype? >> Travis Oliphant: It depends on who you are. To many, big data is it doesn't fit in Excel. That's probably the 85% power law distribution, that's what they think of big data, is does it fit in Excel? So what do I do now? I don't have a better definition than that, other than typically my tools -- more generally than that, the tools I'm used to don't work with this data. For a NumPy SciPy user, it might be much bigger, let's say, because I'm used to dealing with gigabytes of data as a NumPy, SciPy user, and it might mean depending on the machine I'm on, because a lot of people can just buy a bigger machine, put a terabyte of RAM in. They can do big data very easily. >>: Ten, 100 and 1,000 gigs are usually the ->> Travis Oliphant: 10, 100 and 1,000 gig is what you all use? >>: It's like different pockets of people say that's big data. And for some people, that's jump change. >> Travis Oliphant: I've seen 10 terabytes as kind of a boundary. Petabyte, certainly at this point, I think everyone would agree is big data. In Texas, we just call it data. >>: Does CDX have support for doing something when you back, say, a billion records and it's too big to fit in a traditional Canvas? >> Travis Oliphant: Right, so not directly now, but that's really essentially the effort of Blaze. Blaze is an execution engine underneath, and CDX is the front end for that, because CDX uses Pandas, leverages Pandas, leverages NumPy, and those are in-memory sort of tools. But the way it's architected with references, you look at the CDX, you're plotting actual strings, you're plotting references to these tools. So I can put a computed column in here that divides -- HDIV A is one I like to use, which is HP divided by Accelerator, and there are still some bugs here. And I can plot that directly and the plot shows up. Great, cool. It was supposed to show up before. >>: It's less of a technical question. It's more kind of a visualization question. It's how do you look at big data? >> Travis Oliphant: How do you look at big data. Well, you would be interested in these slides, which I didn't go over, which are about abstract rendering. Probably the most interesting thing that came out of the XDATA work this summer, which is a way to actually talk about visualizing huge data that doesn't fit into memory by doing this abstract rendering pipeline. Yes, you can find out about that online, and we can also talk about that, if you're interested in that particular aspect. This is the work of one of our subcontractors at the University of Indiana, as well as Peter Wang. Thank you. It's been great to be here. I'm happy to answer questions for as long as we have the room.

>> Wenming Ye: Hello. My name is Wenming... Research. Today, we've invited Travis Oliphant, and he is...

Related documents

Products

Support

&gt;&gt; Wenming Ye: Hello. My name is Wenming... Research. Today, we've invited Travis Oliphant, and he is...

Related documents

Add this document to collection(s)

Add this document to saved

Suggest us how to improve StudyLib

>> Wenming Ye: Hello. My name is Wenming... Research. Today, we've invited Travis Oliphant, and he is...