15969 >> Andrew Begel: All right, everybody, I'd like to...

15969 >> Andrew Begel: All right, everybody, I'd like to welcome Larry Votta to Microsoft to give a talk today about productivity in scientific computing. Larry is a consultant for software tolerance and productivity. He has a bachelor's from University of Maryland, College Park, and a Ph.D. from MIT in physics. And Larry worked for Bell Labs for about 20 years, 10 years working on realtime software development and about realtime years working on software engineering research. He worked in Peter Weinberger's (phonetic) lab there. And then he retired in 1999 and then went to work for Sun Microsystems working on high-performance supercomputing systems and worked on one of Sun's entries to the DARPA supercomputing project in the last couple years. He's authored more than 60 papers and several chapters in books on software engineering. So I'd like you to welcome Larry Votta. >> Larry Votta: Thank you, Andy. Okay. A little bit of historical perspective here. Back in about '99 the Japanese brought on line what was called the Earth Simulator as a significant technological event to highlight their maturity in being able to build supercomputers and also to highlight a little bit the second millennium coming of age about being able to model the climate changes that are suspected to be induced by human activity, and it was called the Earth Simulator. And it sent a shockwave actually through the Department of Defense and through the weapons industry in the United States because of the fact that it claimed the lead in supercomputers. And that charged off several people in the DARPA, which is the DOD's research programs community, to start thinking about what was going on and what had happened. And this also followed -- people had not been too germane so I'm going to get into a little bit of a historical thing and then I'll get into the remarks about my talk, which is understanding the productivity gridlock in scientific computing. The '90s were very, very tumultuous for high-performance computing. People were buildings a lot of clusters and having a lot of trouble programming them. There were some major failures of programs. The physicist community, for instance, undertook a tremendously ambitious program to be able to fully simulate nuclear weapons, and that was called the ASCI program. Maybe a lot of people didn't see it, but it did affect everybody because it was part of the ending of the Cold War in the sense that all nuclear tests or the Nuclear Test Ban Treaty started in '91. And the reason the U.S. even signed up to it was because physicists said they didn't need to blow up any more, they could basically blow them up in computers. And, of course, that was a bit of wishful thinking, and the '90s came and went and found us still struggling computationally to even be able to simulate reasonably tokamac fusion machines and things that would even have some societal value, although ending the Cold War certainly had a lot of value to society. In any case, the DOD, going back to this three people, Bob Grable (phonetic), Bill Carlson and John Groshe (phonetic), sent a letter to Congress basically out of the Department of Defense saying we're in a lot of trouble, actually. Bob Grable was a program manager, John Groshe was DOD, one of the computing resource people, and Bill Carlson was a three-letter agency top computer architect. And they said, basically, we have a crisis. We have a crisis because they saw all these programs failing. We'll actually see where some or all of this is coming from. And so they basically convinced Congress to actually fund what is called the High Productivity Computing System, which was supposed to be the next generation supercomputer system that would in essence create a leap frog in the ability for scientists and engineers to use the machine to accomplish or add value or work. That's called High Productivity Computing Systems, or HPCS. The program is in phase three now. There are two vendors doing it, Cray and IBM. Sun was in part of phase two. There were six that were down-selected to three in phase two and then two in phase three. Phase one was from 2001 to 2003, 2003 to 2006 and then 2006 to 2011. And those are rough dates, by the way. That's not exactly -- nothing ever falls exactly on December 31st or January 1st, whichever. Depends which side of the interval you want to go on. Anyway, needless to say, Sun did not meet phase three, but I think we learned a lot. And I was very fortunate to be one of the principal investigators in that program. And I also had the advantage that I was one of the few vendors, lead PI's that was also a computational scientist in a former life. I actually wrote one of the first computational fluid dynamic programs at MIT as a high-energy physicist. Actually, I was modelling a neutrino detector, which I really wanted to build, and at the time of 1971, seemed to me to be a very modest price of $10 million. And it would have been stuck out at Fermi Lab about two miles from the accelerator because you need a lot of earth to remove everything except the neutrinos out of the beam. And, of course, the beam then goes through west Chicago and a few other places, but neutrinos go through everything so we don't have any problem with that. Anyway, so I had a lot of fun doing that. I also ended up doing some other computational theoretical work. So it was a very, very unique position for me and actually seduced me to stay at Sun Microsystems to actually go do this. But in the end where I came out was that people talked about productivity crisis, they talked about -productivity gridlock is my word for it. And it's because, like any interdependent system of workflow, anybody's who's studied project management for anything of significance like any of the major operating systems or Office or 5ESS switching software, any of those things, are a huge set of interdependent relationships among people and machines doing things. And anybody who understands those things understands that bottlenecks do occur and gridlock does happen, even when everyone and all entities involved have the right priorities and everything. And so in some sense this is a little bit of that story as well. I will also say this is kind of an interesting story for Microsoft as it was for Lucent Technologies and AT&T Bell Telephone Laboratories and Bell Labs in that I think one of the most interesting problems to understand as we move forward, because of where the wealth in our society has been captured and how it is actually provided to all members society, is how to maintain and evolve large legacy software systems. And so part of this story is about that as well, because what you're going to see is the value of these things is really in the fact that I can computationally do things rather than build them. And what I need to be able to do then is say with some certainty that the computational results represent what would happen in the real world. So in any case, let me go through my remarks and then we can chat about a lot of different things. And Andy was very, very kind in actually setting me up with a calendar for today, and hopefully I'll get a chance to talk with people later on, and it should be a lot of fun. So I'm always a believer that the answer is yes. Well, of course it's yes. The computational scientists need to realize that their greatest strengths may be their greatest barriers to improvement. They were able to take some of the earliest machines and come up with some very, very great results. And, of course, they did some extraordinary things to be able to get enough computation there in the first place. I'll tell you one little story about this. Remember I told you about that simulation I had written. And so I was working for a high energy physics group called the Kendall, Rosensen & Friedman Group, and Henry Kendall and Jerry Friedman are Nobel Prize winners in physics. And I asked Henry Kendall, who had started his career as a computational high energy physicist, what did he do. And this is, of course, working back on one of the first machines. He wouldn't tell me what it was because he wasn't sure it was declassified yet. But he had talked to von Neumann and everything and had actually been working as a graduate student with one of the people at Los Alamos and things. And he eventually admitted to me that he had numerically inverted a five-by-five matrix to do some (inaudible) theory calculation. This is, of course, 1952. But it gives you an idea, that's like -- I mean, you could do that on -- you could have done that on a VAX 11780 with your eyes closed in 1980 or with even probably a Radio Shack Trash-80 probably if you just made sure that you had the floating point arithmetic correct on that. I never remember if I had the right floating point arithmetic or not. But in any case, the why is that -- but what computational scientists have done as they've grown up with the computer industry is they've actually worked on bare metal a lot of times and they've actually done more tricks and more things than you would ever want to really, as a software engineer, even want to look at. On the other hand, software engineers need to realize that past successes solve similar but not identical problems confronting computational scientists, and therefore certain solutions may need to be reevaluated and reformulated. I mean, that seems to be relative. The cost benefits are different in the computational world than they are in the business world and in the automation of supply chains, marketing, all the management of information that we use to actually make our economy more efficient. And that's a different problem than what's confronting the computational scientists or design engineers as they use their tools. Okay. So I'm going to talk about motivation and the productivity problem -- this is really kind of fascinating -- what is it, the communication gap between scientific programmers and software engineers. I'm going to give a little mathologic sidebar, and I actually stuck this in because Andy and I have been talking over the last six weeks or so. And one of the contributions, I think, of our group to the HPCS community has been to bring about a credibility in our studies of how do we really study computational scientists and how do we extract what they're doing in a way that allows us to be scientists in a qualitative and quantitative sense. How do you actually manage that information and, in essence, reduce it down to some set of models that are predictive and allow you to understand what you're doing in design and in the building of these systems. I'm going to talk a little bit about the expertise stat -- I'll actually see if I can manage to burn out somebody's retina -the expertise stat. And one of the most significant things -- this turns out to be the subject of a 2005 Physics Today article I wrote with Doug Post, who was one of the computational physicists that I met doing this DARPA High Productivity Computing System program. The problem is, is that it's exactly like what it was in 5ESS after I got through all the terminology and everything else, which is every time you make a change in a computational code, you don't know if you broke something. Put a different way, how do I trust the results? And anybody who has ever worked on change management and configuration management systems -and I know several people at Microsoft Research spent some time at Bell Labs and everything else, and we had a tremendous configure management systems and we could tell what was changed and so on and so forth. The reality, though, is you don't know if you ever broke it once you've made a change, and that's actually a significant problem. So how about breaking the productivity gridlock. One of the characteristics of the research I like to do is I like to have empiricism embedded in it. So not only do you do observational studies to see what is happening, newer experiments to identify mechanisms so you could build theories and therefore end up predicting, I also like to do experimental studies to see if the ideas that I have or that my team has actually work in general, in vitro initially -- in other words, in the laboratory does it really work -- and then in vivo, in life, meaning put these back into a team of physicists doing the next generation of weather forecasting models or whatever and seeing if it really does improve their productivity and stuff. And, finally, some concluding remarks and then I'm going to mention my collaborators, because there are many and they're fun. And then for those insomniacs in the audience, I have a bibliography, and they can certainly go at it. I'm only going to go highlighting through some of these, and feel free to ask me questions at any time either live or in etherspace there and we'll try to address them in realtime. So I don't think I need in this audience to talk about motivation for why high-performance computing is a big deal. I will recognize that some of the things that I mention to people in the general population, and they don't even think about it until you talk about it, is if you looked at the number of people on our highways and the number of traffic deaths, it's going down. And a lot of it's because of social policies around dealing with people driving under the influence of alcohol. Others are better safety laws, and so on and so forth, and methods to operate a car. But another whole set of things is that, to be perfectly honest, crashes still happen and not as many people die, and auto crash models are a big part of that. And we don't have to crash them into walls anymore. Every automobile manufacturer now can sit there and model and run through 10,000 crash sequences of different configurations and different intersections under different weather conditions and get some idea about how to tune and move safety features around in a car to actually provide the highest probability of survival. Another one which I found is very interesting is Proctor & Gamble packaging. This is one that gets sold to Congress a lot, which I find -- it used to be -- anybody know the pressed potato chips, the Flangles (phonetic) that people make? Pringles! That's it. That's it. Pringles. So anyway, their production, what was happening is that they were flying off the conveyor belt as they were being baked and everything. So what they did was they changed the shape to reduce the lift so they could run it faster and actually produce more Pringles per minute off the assembly line. These are neat things, but it's a computational fluid dynamics program, and they just had too much lift in the Pringles. But the reason I mention it is not all of these computational things have to be, my gosh, kind of, you know, big whiz-bang things saving people's lives, but every one has a little bit of interest. Weather modelling and prediction is obviously one. And one that people didn't really recognize, and actually I know Motorola spent tens of millions of dollars renting supercomputers, was to actually do cell phone layouts of the initial networks. And, of course, there were different techniques for doing that and all kinds of things. But it's a science unto itself. Trust me, when you have scattering of electromagnetic radiation off of water towers and hills and iron deposits in those hills, things get interesting real quick. I don't know how many people are aware that there's a whole set of experiments continually being done on you every time you walk into a Costco or WalMart. They put stuff in certain places to see if they can -- I don't want to sound too cynical -- make it easy for you to buy it, I guess is the right thing. And so they actually talk about yield per square foot in certain parts of the store and everything else. I find it very, very fascinating. But, literally, when I was doing some of this phase two computer stuff, obviously DARPA is always concerned about looking for commercial applications. And one set that I wasn't even aware of was that just the realtime capture of information at all the WalMart stores or Walgreen stores or Costco stores and figuring out and redistributing their inventories, that that's one set of problems. Another set of problems is just how do I -- where is the right place to put some of these things? What do people buy when they only buy three things and are only stopping for five minutes versus stopping for half an hour and buying 10 things? There's all these kind of things. Obviously Pixar and stuff like that, the entertainment motion picture industry. One that I ran into last night which was a lot of fun, and I had to put that in, was swimsuit design. Did anybody see NBC Evening News last night? So I used to be a swimmer in my high school life, and one of the things that's very, very interesting is this year before the Olympics everybody goes into the trials, and swimmer's are usually the last ones to do their who's going to be on their Olympic swimming team kind of thing. And it turns out that if you look at the number of world records broken it's, like, dramatic. And it turns out that NASA has been working with the elite swimmers in the United States and they've designed suits that actually reduce the drag so significantly that more world records have been broken in the last trials of the last three weeks than has ever been seen in the sport before, and people are now wondering what's going on kind of thing. And it's not broken by two hundredths of a second kind of broken. These world records are being broken by a second and a half. And at 45-second ->>: (Inaudible). >> Larry Votta: My guess is they probably -- but I mean ->>: (Inaudible). >> Larry Votta: Speedos. Yeah, Speedo is the one. But I only point it out because this is kind of a, hey, if you've got to swim fast away from the shark, it might be a really useful thing to have a Speedo suit on. In any case, so what is the productivity problem? Well, you know, this is a slide that was put up by the DARPA program managers and everything, and I kind of -- it's visually stunning. It sort of captures some of the information that, gee, today we have a hundred to a thousand processors and the future is 10 to 100K, and it really does feel like -- if you've ever done any of these computations, it really does feel like -- I'm not sure if a dragon is sitting over the cliff. I usually think more of a snake coming up to bite me kind of thing and tripping me up whenever I thought I could do one of these things. But in any case, the point of this picture is, this is one visualization that I think plays well to a non-technical audience, but it really doesn't help you at all understand what the real problem is. We don't have dragons sitting on the other side of precipices and we, as naive software engineers, are walking right off the bridge right into the dragon's mouth kind of thing. That's not what it really is for computational scientists. So that's why I put it up, by the way, just to be a little bit humorous. But let's go back. So here's kind of an interesting or another approach to the productivity problems is first principle kind of thing. Being a physicist, I'm a first principle kind of person. You can't take the physicist out of the person, you know? I'm sorry. The productivity is equal to the utility over the cost. This is the standard economic definition. And it's conceptually a great idea, value per unit cost increases, per capita wealth increases. The only problem is that things always get renormalized, okay? Inflation. You know, 1971 dollars are a lot different than 2008 dollars. And definition of value. I mean, there are technologies that have grown up just in the last 50 years that completely invalidate any of the value propositions of organizations and the way things were done in the '50s and '60s. So the bottom line is that it's always a continuously renormalizable kind of relative ratio measure. And what it would mean for computational science is kind of a good question. And so from these first two slides what I'm trying to motivate is, the first thing that really we have to do is we really have to get underneath the hood and figure out -- push the metaphor of an automobile -- but to figure out really what the problem is. And so we're going to do that a little bit and I'm going to sketch out how we went through some of this. Okay. So if you come at it from exactly a computational scientist's viewpoint, this is what is -- sorry about that for people online there. I think I was moving my beard around -- the first is, if you look at the problem, you have a performance challenge. Designing and building high-performance computers is a hard thing to do. You have a programming challenge. You want to do rapid code development. If you're doing application development like, for instance, either investigating new science or building new application programs for doing new kinds of computations, you have to optimize these codes for good performance. And then you have the prediction challenge, developing predictive codes with complex scientific models, develop codes that are reliable and have predictive capability. Now, the problem is, is that -- I can tell you right now that in 5ESS, any major release of 5ESS software, over 50 percent of the effort was spent in verification and validation, all right, just to give you an idea. In other words, did it work all the time, was the system designed doing what it was designed to do. The computational science, depending on the domain I'm in, has, you know, different types of cost benefits. For instance, one of the biggest growth industries, I think, besides entertainment, which I think is one where high-performance computing is really getting into and really making headway, is what I call computational medicine where actually some genomic marker information of what types of protein is being expressed from your DNA is actually input into some set of computational models that allow you, for instance, then to actually adjust treatment. And you're starting to see some of this in some of the big -- and this is really interesting for people who want to read about this, I'm not sure there's a book out about it yet, but the war on cancer has been a very interesting kind of thing because there people wanted to cure cancer, and it turns out to be sort of similar to a lot of other problems. Well, it's great conceptual goal. The problem is we don't even know what cancer is, so curing it is kind of a hard thing. So in the last 40 years president Nixon in '71, was it, or maybe 40 years ago almost now, declared the war on cancer. And it was a great war, but unfortunately we didn't even have the science there yet to understand what was going on. But bottom line there, verification and validation is going to be kind of important. And actually you'll see, we actually went in and studied a computational biochemist to see how they would rewrite an amber code. An amber code is an electrostatic potential code of complex proteins. The bioactivity of a protein is dependent on how it folds in a three-dimensional space when it's in equilibrium with the radiation field. And actually it's not one shape but it's many shapes. And the reality is, is you need to have -- you need to do an optimization problem because the principle of least action sort of indicates that the shapes that it will fold into are the minimum electrostatic potential configurations. And we were actually working with one of the top people writing that particular code. But that's sort of the basic science that has to be done before you can start writing an application that might simulate, and then, of course, eventually get to this vision of a computational medicine kind of thing. It also highlights the kind of problem and evolution you're looking at at any given time when you go in and look at the high-performance computing community is that I don't think there's any problem where there are multiple scales involved in both time and space where things are approximated. One of my collaborators, Doug Post, tells a story about being around, and he worked at Los Alamos and Lawrence Livermore, and he was talking about they would have a computational science meeting once a week, and they would be simulating how nuclear weapons worked and everything. This is back in the '50s, and Edward Keller was one of the physicists who was doing that and was for many, many years the director, and Edward Keller has a lot of history in this whole arena. But one of the interesting things that happens and that you see even in the '50s, and they realized this, is that you couldn't do the computation exactly. And so what you were always doing was finding good approximations. And the story that's told by one of my friends is that he was a young graduate student up there for the first time trying to defend this set of approximations, and Edward Keller is in the back of the room and says, "Approximation has no physics. Throw it out of the code." But the problem is, is it seemed to work because what they had done is they had experimentally identified it, which brings about a lot of interesting questions about verification and validation. So I talked about it from the scientist's point of view, but what does it really mean? I mean, I put up those three things, and yeah, we have this continuously changing set of platforms, we have the continually evolving and maintaining code base and everything. The reality is, is some of the symptoms you're seeing in long and troubled software developments, you're seeing dysfunctional markets to support tools, you're seeing scientific results based on software, sometimes not always correct. In fact, some of these have been tremendously tragic, and gosh knows how much it's really going to end up costing the U.S. economy at some point. Scientific programmers have some interesting views: Computer scientists don't address our needs, there isn't enough money. And really, as we understood and started studying this community, we understood a couple of things. So we actually -- and you'll see my sidebar in a minute, which is kind of interesting how we study this kind of thing. But what we ended up observing was a communication gap. And so let me talk about the communication gap. Yeah, Tom? >>: (Inaudible)? >> Larry Votta: Okay. So Tom's question is the scientific results based on software. I'm on slide nine, so let me go and I'll show you exactly that. Okay? >>: (Inaudible)? >> Larry Votta: Yeah. The scientific results based on software is the problem of the verification and validation. Okay? In other words, what happens, you have a code that tells you this, this, and this and you go out in nature and that's the way this, this and this turns out to be so the code's correct. It then makes another prediction that isn't correct. Okay? However, what happens in scientific results based on software is that it not only becomes a tool for an engineer to use, it also is a tool for scientists to use to figure out what the laws of nature are. Okay? And sometimes you're wrong. And a lot of times, like in experimental methodologies, we have hypothesis tests and we say, okay, we have a confidence level and so we set up basically a statistical gain and then we can probabilistically reason about whether we can reject the null hypothesis and so on and so forth. Okay? >>: (Inaudible)? >> Larry Votta: Okay. The symptoms of the productivity crisis are -- basically the worst case in productivity is you get the wrong answer because no work actually got done and zero value got added, right, because the numerator ->>: (Inaudible)? >> Larry Votta: The wrong results, correct. Incorrect result, right? Got it? Yeah. Sorry. That's a good point. Yeah. Okay. By the way, at the end of this talk people have access to the PDF online and here. I picked out two cases of exactly that. One is a plasma physics result that was widely believed that stopped the United States from entering into the ITER program, which is the international collaboration to build a demonstration fusion toroid power generation device, and the second one was the Challenger Shuttle had a program called Crater, which looked at the impact of micrometeorites on the heat shield of the Space Shuttle wing leading edges and made certain predictions. And, of course, we know that one was very tragic. The Crater indicated that there wasn't going to be a problem for the Challenger to come back down. And so, consequently, nobody ever thought, well, what if the computation is wrong, because the computation had always been correct up until that point. So it was tragic at many different levels. So let me talk about the communication gap. Two cultures separated by a common language. It's kind of interesting. And in fact, one of my collaborators is a cultural anthropologist, and she loves to -- when she gives this talk she does it much better than I do. And what she talks about is that you have these people that speak the same language and can't figure out why they aren't communicating. And that's exactly -- you know, the software engineer thinks they fixed the problem already, the scientist just rolls their eyes in exasperation and tells them exactly what's wrong. And so it's a conceptual disconnect even though the language is well understood by both parties. And it's very fascinating. And it really gets to the point of exactly what you see when you see a meeting of software engineers and computational scientists. So scientific programming really -- what the software engineer really has to start thinking about a little bit is, it's all about the science. That's what they think about. Or the engineering if you're talking about the running of a crash code or a title search program to see what's going to happen as a Category 4 hurricane comes up the Mississippi River valley and so on and so forth. Scientific and engineering codes are expensive. Codes live a long time. Partially because the verification and validation strategy of any code is a very long time. Performance really matters. The value, almost always, of those codes, they're pushed to the limit of the computational science. For instance, current weather prediction models would love to be able to hit the tornados and other violent storm events that occur in the western part of the United States. It turns out there just isn't enough computational ability to give enough resolution to look for the formations of certain micro-storms. There just is not. And, trust me, it's been years of research and adaptive grids and adaptive time models, so on and so forth. It's a real issue and will continue to be a real issue. It's also for high-energy physics. There's another whole set of problems in that as well. Hardware platforms change often. Think about the case where the hardware changes two or three times in the development cycle of your software rather than the other way around. Okay? It gives a whole different set of project management problems and portability problems and so on and so forth. It's all for tran 77 and C++. That's what it is. And this is partially because of the legacy codes. And, in fact, this is very true of 5ESS and very true of any large legacy software system that lives a long time. The tools get frozen in so the ability to do abstraction automation and any productivity that's associated with building or modifying or evolving that software gets frozen in at the time those tools are originally decided. Because I basically ended up trying to put C++ in 5ESS, and I actually did some experiments on 5ESS, and it was the most difficult thing because it was like first you had to get the runtime environments right, then you had to get the debugging tools right, then you had to get the automated libraries that brought in the pieces that it changed. I mean, it was just one thing after another, and you could spend years working out all the problems. Hardware costs dominate and portability really matters. If you're going through several generations of hardware in the course of the development of the software application, portability is a big issue. Okay. Yeah? >>: So while I agree that the codes can live 10 to 15 years and so the hardware platform that you're actually running on will change over that pattern, the class of the hardware doesn't really change. I mean, we've had vector machines about 15 years and now we've had cluster machines about 15 years. So the general class of the machine and, therefore, the way you have to program or write the programs, previously it was vector-type loops and now it's all about communication and data distribution. That doesn't really change. So, you know, I quibble there about saying it's really the platform changing. Because we've had message passing on distributed memory codes for about 15 years now. >> Larry Votta: Yeah. Yeah. >>: And then the hardware costs dominating, I mean, sure, a certain computer can cost about a hundred million dollars and it's bought maybe once every three or four years, but the programming costs at the national labs, I mean, each of the budgets there is about a billion dollars per year, and there are about three of them. So, you know, compare a $3 billion personnel cost per year versus a hundred million dollar hardware cost every four years ->> Larry Votta: So let me abstract two of the things because we could end up spending the rest of the time discussing the two points you brought up. The first point was that the style of supercomputer machines hasn't really changed, and actually their architecture is relatively similar, and there are a lot of things around that that have allowed some amount of abstraction and automation to be developed and, therefore, some productivity. And I agree with that. It gets a bit more complicated because different applications do different things, and we can talk about that offline. The second comment has to do -- help me out. What was it? >>: The hardware costs. >> Larry Votta: The hardware costs dominate. It's a perfect observation and one that is very social science and political oriented and not technologically oriented in the sense that the national budgets of the national labs is very, very diffusely connected with the acquisitions of these systems and everything. In other words, they all have to be approved by Congress, and there's influences there that don't allow for direct feedback loops of improving productivity with the machine design. And a perfect example is the new technologies that really need to be done on machines that are petascale where you have a hundred thousand processors, you have to start talking about virtualization, runtime virtualization, you have to start talking about -- the machine has to be able to keep track of itself in the sense of the switching system, which could always tell if it was partially broken or not. The technology has not come into normal computing play in general except very recently in certain types of networking things and stuff. But those are very, very good comments. Let me give you the idea -- first off, 35 years. We went in and studied several of the DOD/DOE benchmark codes that are used for all kinds of things, modelling the deterioration of materials and holding radioactive waste, so on and so forth. And the bottom line is, is basically 10 years is about the development cycle of these things. And, in fact, right now this is the current projected profile of the Falcon, which is a new code -- I won't say what it does because it's a DOD code -- but, in essence, this is its model, and this model is based on historical, what they do with other codes. So initial development typically takes five years. Serious testing by customers is another five years. You then get into this maintenance and evolution kind of cycle from 10 to 25 years, so 15 years of real useful stuff, and 25 to 35 years you get into a retirement phase. But, in essence, for 25 years you're getting useful results out of these codes. And so part of the problem is software engineers haven't seen this. Online transaction processing with checks, so on and so forth, didn't last this long. It's been completely replaced with Java and other runtime environments and transaction processing systems. That's the problem. And why is that? Partially because for the scientists, the verification and validation problem is a tough one, making sure that the code is really doing the right thing and doing it correctly. And by the way, this is much more reminiscent of what 5ESS looks like, which started about 1978 and is still being sold by Lucent-Alcatel Technologies today, which 1982 is the first ship, so we're at 2008, so 25, 26 years. So we're just about probably at this part. And 5ESS was probably the most successful digital switch. Anyway, let me continue on. The productivity gridlock then is a programming -- so we're still peeling away this onion. We sort of see these things. Some of the assumptions don't match what software engineers are doing. But, in essence, what we actually ended up doing is going in, watching and observing scientists doing their work and engineers actually using crash codes and other things to see what they were doing. And the programming workflow has some bottlenecks in it. Developing the scientific programs, the serial optimization and tuning, the code parallelization and optimization and the purporting. And these really turn out to have two fundamental things. The manual program effort -- in other words, we still haven't gotten enough automation here -- and the expertise. And let me say a little bit more. Now, I want to talk about -- I'm going to come back to that in a minute, but I want to talk about how we investigated this for the five years we were working on the problem. What we did is embraced the broadest possible view of productivity, including all the human tasks, skills, motivations, organizations and cultures -- therefore, your question about the national labs having a $1 billion software engineering budget and only buying $100 million machines is a pretty valid model. The problem is that it fits an organization in a set of cultures that don't create the feedback loops that are needed for evolution of the general system -- put the investigation in the soundest possible scientific base, both physical and social scientists methods, and this results in a three-stage research framework which we found very useful. So the stages are explore and discover, test and define, and evaluate and validate. And the goals are develop hypotheses, test and refine models, replicate and validate findings. And we use a series of methods: Qualitative, qualitative and quantitative, and quantitative. And so we go the whole spectrum, everywhere from ground theory all the way to quasi-experimental designs where we tried to go through more of a biological metaphor of in vitro/in vivo, like what happens in software engineering now. If you're a tool builder -- so you have a hypothesis that's providing this abstraction and automation, the abstraction offered by the tool. The automation offered by the execution of that tool is going to improve productivity. You don't first go and try to sell, let's say, Office to use this tool and convince their management to switch over to it in one day. What you do is you actually work on the tool and you develop in vitro -- in other words, outside of the main development lines and everything else -- the fact that it works and everything, and then eventually you have an adoption strategy and you move it in in a rational way. And, in essence, that's sort of a biological metaphor. You start with an in vitro version and you move to in vivo, and you try to keep it quantitative, and sometimes you use qualitative methods because your experimental designs assume certain things, and sometimes those assumptions are wrong and then you have to go back. So this isn't a straight-line process as much as it's often iterative in its nature. And I already mentioned that. So let me go through this a little bit. So what we -- there are two major workflows that you see in scientific computing that are troublesome. The first one is the scientists or set of scientists building the original code and doing the science programming or investigating the science. So they're in this continual very rapid evolution kind of thing of their application. The other phase is where you end up having a code, like a crash code, and now you're just going to crash the automobile, but now you want to do different automobiles and now you want to do different crash situations, and so consequently the basic finite element model and metal deformations and all that stuff is pretty much done. What you really need to do is just change the databases and everything. So one is like developing the codes and the other is really running the codes, let me just say that. And we tried to capture this a little bit. But this is the general model, meaning it also includes developing the code originally. And what you discover is that not only do you need the main scientists, you also need people who really know how to write software, and then you need people who really know how to manage and maintain and evolve software, and then you need to have people who know how to take an application and actually tune it. There's a lot of different very highly specialized skills here. So there are four distinct skill sets. The main science, the scientific computing -- just mapping it onto a cluster and understanding the reliability performance characteristics of a cluster and how you might want to organize your application on them is kind of an interesting set of challenges -- the scaling is also another piece of how do I get that, and then the management. So the skills are useful when they are synchronized through communication, collaboration, or exist in one person. And I love this. What you observe, of course, is the teams that have good communication protocols and everything and work together as a team can do this a lot better than a group that are very disjointed and so on and so forth. And it even works best when you put the skills all in one person because then we only have to apply the fact that the person is sane and -- well, we made an assumption there. But the reality is, is that what you find out is two skill sets are rare. Think of somebody who is at the lead in scientific computing and also in the domain science at the same time and writing in both literature bodies. Just go and look at people who are there, okay? Three skill sets are very rare, and four is like a moon rock. Okay? So that, in essence, is the problem. And the reality is, is if you think about the social institutional evolution of the labs, the DOD labs, the Department of Defense labs, the Department of Energy labs and so on and so forth, it's not one that really lends itself to great teamwork kinds of things, although that's what's been asked of a lot of them. Okay? So if you had to say, then, in a nutshell, what is the productivity gridlock, the productivity gridlock really comes about because of the need for very, very highly specialized sets of skills being very, very tightly coordinated and orchestrated to maintain and evolve either the software or to apply it to different situations and run it for different cases. And almost always, because these machines, the clusters that we talked about, the message passing, so much of that stuff depends on the nature of the geometries of problems, for instance, in finite element modeling and so on and so forth, that it's not just a simple reapplication, it's almost sometimes the whole retuning of the application has to happen. If you're going to change, for instance, way the car crashes, you might want to take -- instead of doing car crashes, think about just crashing 24-foot motorboats. Same kind of poles and, you know, more water in there, but you can start seeing it's a complete redo of everything because of a lot of different pieces of the application, although a lot of the finite elements and the basic science is all the same. So I'm going to talk about -- so one other thing that we discovered is it's really kind of tragic, and what's going on is the scientist have tool complaints. And the reality is that the productivity on being able to either evolve and maintain an application or to run in another thing, those abstractions and automations, which is sort of the bread and butter of software engineering, are, in essence, encapsulated in tools. And what we have in the scientific community is tools that are hard to learn. They don't scale, differ across platforms, poorly supported and too expensive. And, in essence, what we're looking at is probably to a certain extent the scientists not understanding, nor the phenomena of being understood, that once you use those tools to develop your initial application, they became frozen in, and the productivity proposition of those tools actually is frozen into that application forever basically. And then you have to develop a business model that maintains those tools for 35 careers. That's how long it took. Or at least have a maintenance and evolution strategy for them. And what you see -- where does this come and get you? So what the software engineer doesn't see, perhaps, is that in some fundamental sense the 35 years is a big deal. Okay? And that general computing IDE's have different assumptions. Lifetime in code slows evolution of tools. The field is small and specialized, so from the current business models that's a really bad thing because there's no scale involved, so consequently these things are very, very expensive to build, and investment in tools is insufficient. Another important thing is -- this is really true of when we were studying some clusters and people were using different Intel processors and then they ran over and were using AMD processors, and there were a set of compilers that were being made available by Intel Research that were doing some very, very nice, elegant kinds of parallelization detection and so on and so forth in old Fortan 77 codes and C++ codes. And they actually got removed from the market because they didn't -- they were actually being maintained by Intel Research, and they decided they weren't going to go that way architecturally so they just wanted to take them off. So sometimes even these things for business reasons get removed. And since they weren't part of the open group community, it caused a really great problem in how clusters were being used for certain types of computational science computations. So this is revisiting that early point I made about the science coming out wrong because the computation really wasn't valid. And, really, it comes down to -- here's what the issue really is, and scientists have to think about this a lot harder. Trust in the validity of the computational outcomes is a key productivity issue. Because if it's wrong, you've got no productivity. Okay? The value is zero. Okay? And you could have spent whatever you spent. How do scientists build confidence in their codes? They look right. These are real comment when you go and ask the scientists about it. Four or five years to get some confidence in the code. In other words, they run it on a lot of different problems they can figure out and everything, and then, you know, you sort of -what you find out, of course, is that the domain scientists, the really good domain scientists, have a set of ideas about the verification and validation sequence because they understand the weaknesses of computational strategies and so they even have some ideas about how to do this. But no way is that codified or shared in the community. So not even common knowledge like that gets shared. Oh, and this comment was really great. It's sort of an inverse flip. We were talking to some people about, well, what happened when you abandoned this code and everything? One guy was -- it was actually a woman scientist. She was very upset because she had just finally gotten to understand and have some confidence in the code and they decided not to fund it anymore. And now she had to start out at the beginning again, which was pretty -- it was very hard. So the point here is, the problem is that you always have to be careful about this just like you do in any experimental methodology. But it's an area where I think scientists have to innovate. And they manage threats to validity, meaning the interpretation of a particular experiment. They just have to get there, and work's needed here. So breaking the gridlock in software engineering. So the answer is, is it's not like software engineering can say, here, take two of these and come back and see me in a week. Okay? The reality is, is that the two communities have to work together, and they have to sort of close that gap in communications that has grown up between the two of them. And in some sense software engineering has a general paradigm of automation, abstraction and measurement to introduce, and the scientific programming community has to start thinking about investment and modernization. And there are some common things that they have to work on together, like how best to use computers to help with the verification and validation program. Let me give you a little anecdotal story about how 5ESS used to do it. We built computers to generate load on a 5ESS machine. We had switches that basically did nothing but make telephone calls on a 5ESS machine so we could test it. Well, you know, that's one way of doing it. Not the only way of doing it. But those are the kinds of things that you have to start thinking about because, after all, there is no productivity proposition if the end results are incorrect. So let me talk about some empirical studies. I like doing empirical studies. I am an empiricist at heart. I put up these two myths, and I wanted to be a little bit controversial, and I'll tell you what the experiments were and so on and so forth. So what we did was in the computational science arena we actually identified some domain scientists who were building computational codes for a plasma fusion machine for toroids, and in fact we have -- myself being a high-energy physicist, Eugene Lowe (phonetic), who is one of the other collaborators here. Eugene has his Ph.D. in Computational Physics from Cal Tech. And so Eugene actually went and became an apprentice to the person at Princeton Plasma Laboratories who wrote the code on -- and I'm not even sure. I'm going to say geokinetic whatever. But anyway, it models the energy loss mechanism in a tokamak. And so what Eugene undertook to do after he did the NAS Parallel Benchmarks is he wanted to go and see if he could rewrite the code to be more expressive and much more productive. And so we actually went in and started doing some of this. We also had another scientists who was doing the electrostatic potential on codes. The net-out was that we were able to improve productivity by about a factor of 10, and I'll explain that experiment in a little bit. Another computer science myth is the prescription of implement serial version then parallels for multicore and clusters. Experiments with programming teams indicate this is the wrong strategy. Suboptimal solutions are achieved. Walter Tichy at the University of Karlsruhe has actually started doing experiments on this. And at the International Conference on Software Engineering there was a workshop before the conference on multicore systems, and Walter showed me some of the results of some of the initial experiments, and it certainly indicates that the teams that achieved the best parallelization got the best running code with the fewest defects, and so on and so forth, were the ones who tried directly to go to the parallel implementation. Apparently the serialized version creates certain barriers because now it works and it works good enough so you don't actually go back and redo certain things. So you make these engineering trade-offs, so you end up with a mixture of serial and parallel. And it's kind of fascinating, but it's not -- it's past the scope of this, but it's an interesting kind of investigation that software engineering can do to get to the heart of this, which is actually one that I think is kind of fun. So what we did was looked at the NAS Parallel Benchmarks and port DF77 version to a more modern Fortran 90 version. We did these four things. The result: We reduced the source code by about a factor of 10, and because we reduced the source code by a factor of 10 we inferred that we reduced the cost by about a factor of 10 since any of the Kokomo cost models or anything for software development all basically are linearly related to the size of the software and these scales. And something else surprising happened. Here's one of the NAS Parallel Benchmarks. Now, the NAS Parallel Benchmarks takes some of the more successful computational fluid dynamics codes or other kernels of the application and uses them to benchmark different architectures, and so on and so forth, and they're all written in Fortran 77. And then they also have a specification, and they talk about some type of iterative methods to solve things using multicomputers, and they have MPI implementations and so on and so forth. But what scientists have been trying to do is, obviously, even keep that at a low level and just give you a cleaner representation. Here's a Fortran 90 high productivity version of this. The reality is, is that -- the point being is that Fortran 77 -- and you saw earlier that if you go and look at the NAS parallel benchmarks for Fortran 77, you see a lot of configuration testing, machine testing. There are a lot of non-portable elements in it that have been sort of (inaudible) out by testing certain machine configurations. But the bottom line is, is that this is much harder to understand and match to this specification than something like this. And the reality is, is that what software engineering knew already was that representation is a big deal. And not retranslating the specifications over and over again is another big deal. In other words, matching the actual representation that gets compiled to the formal representation. And this is the sort of work that Parnis (phonetic) has done over many years, and John Gannon I think was the one who did the thing just about simple comments in the mid '70s. But the bottom line is that you can see that not only is the code much more compact and clear but that at some point I don't have to worry about the verification and validation problem. I have to prove that my compilers do the right things, but I don't have to show -- but just by general inspection I can reach the conclusion that I've implemented what the specification asked. So concluding remarks. Well, of course, yes is the answer, and computational scientists need to realize that their greatest strengths may be their greatest barriers to improvement, and software engineers need to rethink some of their solutions, and the two communities have to work together. And that's really the bottom line of the talk. And hopefully -- I wanted to mention some of my collaborators. The particular coauthors on this paper, which was written for the IEEE Software magazine that was just out this month, I think, is that -- and, of course, we didn't get it accepted -- was that -- it was Stuart Faulk and Susan Squires and Michael Van De Vanter. Susan, Michael and I formed the productivity team at Sun, and then Tom Nash is a physicist, Doug Post is a plasma physicist, Walter is at the University of Karlsruhe, Eugene Lowe, Declan Murphy, Chris Vick and Allan Wood are at Sun Microsystems still, and all were involved in the productivity work and some of this. You can actually go and look up a lot of these. A lot of individual results are written up, and I would prefer, rather than deep-diving into one, I just wanted to sort of give an overview because to me there's two important elements to this research: The composition to see the whole to answer a big significant question, and the second is how do you do little studies in that framework to actually help yourself and make sure your understanding is correct. So the bibliography talks about some of those important studies and case studies we did, the different communities that get involved and this kind of stuff. I think I gave a pointer to Walter's paper on multicore programming if people are interested in looking at that. And that's actually a very new result. That's a result this year. But it is something that occurred to us to ask the question a few years ago: Is that the right prescription, do the serial version and then rewrite it for parallel. And it doesn't look like it is the best solution to the problem. Let me see. Was there anything else I have? No, that's it. Questions? (Applause)

15969 >> Andrew Begel: All right, everybody, I'd like to...

Related documents

Products

Support

15969 &gt;&gt; Andrew Begel: All right, everybody, I'd like to...

Related documents

Add this document to collection(s)

Add this document to saved

Suggest us how to improve StudyLib

15969 >> Andrew Begel: All right, everybody, I'd like to...