15969 >> Andrew Begel: All right, everybody, I'd like to...

advertisement
15969
>> Andrew Begel: All right, everybody, I'd like to welcome Larry Votta to Microsoft to give a talk today
about productivity in scientific computing. Larry is a consultant for software tolerance and productivity. He
has a bachelor's from University of Maryland, College Park, and a Ph.D. from MIT in physics. And Larry
worked for Bell Labs for about 20 years, 10 years working on realtime software development and about
realtime years working on software engineering research. He worked in Peter Weinberger's (phonetic) lab
there.
And then he retired in 1999 and then went to work for Sun Microsystems working on high-performance
supercomputing systems and worked on one of Sun's entries to the DARPA supercomputing project in the
last couple years.
He's authored more than 60 papers and several chapters in books on software engineering. So I'd like you
to welcome Larry Votta.
>> Larry Votta: Thank you, Andy.
Okay. A little bit of historical perspective here. Back in about '99 the Japanese brought on line what was
called the Earth Simulator as a significant technological event to highlight their maturity in being able to
build supercomputers and also to highlight a little bit the second millennium coming of age about being able
to model the climate changes that are suspected to be induced by human activity, and it was called the
Earth Simulator.
And it sent a shockwave actually through the Department of Defense and through the weapons industry in
the United States because of the fact that it claimed the lead in supercomputers.
And that charged off several people in the DARPA, which is the DOD's research programs community, to
start thinking about what was going on and what had happened. And this also followed -- people had not
been too germane so I'm going to get into a little bit of a historical thing and then I'll get into the remarks
about my talk, which is understanding the productivity gridlock in scientific computing.
The '90s were very, very tumultuous for high-performance computing. People were buildings a lot of
clusters and having a lot of trouble programming them. There were some major failures of programs.
The physicist community, for instance, undertook a tremendously ambitious program to be able to fully
simulate nuclear weapons, and that was called the ASCI program. Maybe a lot of people didn't see it, but it
did affect everybody because it was part of the ending of the Cold War in the sense that all nuclear tests or
the Nuclear Test Ban Treaty started in '91. And the reason the U.S. even signed up to it was because
physicists said they didn't need to blow up any more, they could basically blow them up in computers.
And, of course, that was a bit of wishful thinking, and the '90s came and went and found us still struggling
computationally to even be able to simulate reasonably tokamac fusion machines and things that would
even have some societal value, although ending the Cold War certainly had a lot of value to society.
In any case, the DOD, going back to this three people, Bob Grable (phonetic), Bill Carlson and John
Groshe (phonetic), sent a letter to Congress basically out of the Department of Defense saying we're in a
lot of trouble, actually. Bob Grable was a program manager, John Groshe was DOD, one of the computing
resource people, and Bill Carlson was a three-letter agency top computer architect.
And they said, basically, we have a crisis. We have a crisis because they saw all these programs failing.
We'll actually see where some or all of this is coming from.
And so they basically convinced Congress to actually fund what is called the High Productivity Computing
System, which was supposed to be the next generation supercomputer system that would in essence
create a leap frog in the ability for scientists and engineers to use the machine to accomplish or add value
or work. That's called High Productivity Computing Systems, or HPCS.
The program is in phase three now. There are two vendors doing it, Cray and IBM. Sun was in part of
phase two. There were six that were down-selected to three in phase two and then two in phase three.
Phase one was from 2001 to 2003, 2003 to 2006 and then 2006 to 2011.
And those are rough dates, by the way. That's not exactly -- nothing ever falls exactly on December 31st or
January 1st, whichever. Depends which side of the interval you want to go on.
Anyway, needless to say, Sun did not meet phase three, but I think we learned a lot. And I was very
fortunate to be one of the principal investigators in that program. And I also had the advantage that I was
one of the few vendors, lead PI's that was also a computational scientist in a former life. I actually wrote
one of the first computational fluid dynamic programs at MIT as a high-energy physicist.
Actually, I was modelling a neutrino detector, which I really wanted to build, and at the time of 1971,
seemed to me to be a very modest price of $10 million. And it would have been stuck out at Fermi Lab
about two miles from the accelerator because you need a lot of earth to remove everything except the
neutrinos out of the beam. And, of course, the beam then goes through west Chicago and a few other
places, but neutrinos go through everything so we don't have any problem with that.
Anyway, so I had a lot of fun doing that. I also ended up doing some other computational theoretical work.
So it was a very, very unique position for me and actually seduced me to stay at Sun Microsystems to
actually go do this.
But in the end where I came out was that people talked about productivity crisis, they talked about -productivity gridlock is my word for it. And it's because, like any interdependent system of workflow,
anybody's who's studied project management for anything of significance like any of the major operating
systems or Office or 5ESS switching software, any of those things, are a huge set of interdependent
relationships among people and machines doing things.
And anybody who understands those things understands that bottlenecks do occur and gridlock does
happen, even when everyone and all entities involved have the right priorities and everything. And so in
some sense this is a little bit of that story as well.
I will also say this is kind of an interesting story for Microsoft as it was for Lucent Technologies and AT&T
Bell Telephone Laboratories and Bell Labs in that I think one of the most interesting problems to
understand as we move forward, because of where the wealth in our society has been captured and how it
is actually provided to all members society, is how to maintain and evolve large legacy software systems.
And so part of this story is about that as well, because what you're going to see is the value of these things
is really in the fact that I can computationally do things rather than build them. And what I need to be able
to do then is say with some certainty that the computational results represent what would happen in the
real world.
So in any case, let me go through my remarks and then we can chat about a lot of different things. And
Andy was very, very kind in actually setting me up with a calendar for today, and hopefully I'll get a chance
to talk with people later on, and it should be a lot of fun.
So I'm always a believer that the answer is yes. Well, of course it's yes. The computational scientists need
to realize that their greatest strengths may be their greatest barriers to improvement. They were able to
take some of the earliest machines and come up with some very, very great results. And, of course, they
did some extraordinary things to be able to get enough computation there in the first place.
I'll tell you one little story about this. Remember I told you about that simulation I had written. And so I was
working for a high energy physics group called the Kendall, Rosensen & Friedman Group, and Henry
Kendall and Jerry Friedman are Nobel Prize winners in physics.
And I asked Henry Kendall, who had started his career as a computational high energy physicist, what did
he do. And this is, of course, working back on one of the first machines. He wouldn't tell me what it was
because he wasn't sure it was declassified yet. But he had talked to von Neumann and everything and had
actually been working as a graduate student with one of the people at Los Alamos and things. And he
eventually admitted to me that he had numerically inverted a five-by-five matrix to do some (inaudible)
theory calculation. This is, of course, 1952.
But it gives you an idea, that's like -- I mean, you could do that on -- you could have done that on a VAX
11780 with your eyes closed in 1980 or with even probably a Radio Shack Trash-80 probably if you just
made sure that you had the floating point arithmetic correct on that. I never remember if I had the right
floating point arithmetic or not.
But in any case, the why is that -- but what computational scientists have done as they've grown up with the
computer industry is they've actually worked on bare metal a lot of times and they've actually done more
tricks and more things than you would ever want to really, as a software engineer, even want to look at.
On the other hand, software engineers need to realize that past successes solve similar but not identical
problems confronting computational scientists, and therefore certain solutions may need to be reevaluated
and reformulated. I mean, that seems to be relative. The cost benefits are different in the computational
world than they are in the business world and in the automation of supply chains, marketing, all the
management of information that we use to actually make our economy more efficient. And that's a different
problem than what's confronting the computational scientists or design engineers as they use their tools.
Okay. So I'm going to talk about motivation and the productivity problem -- this is really kind of
fascinating -- what is it, the communication gap between scientific programmers and software engineers.
I'm going to give a little mathologic sidebar, and I actually stuck this in because Andy and I have been
talking over the last six weeks or so.
And one of the contributions, I think, of our group to the HPCS community has been to bring about a
credibility in our studies of how do we really study computational scientists and how do we extract what
they're doing in a way that allows us to be scientists in a qualitative and quantitative sense. How do you
actually manage that information and, in essence, reduce it down to some set of models that are predictive
and allow you to understand what you're doing in design and in the building of these systems.
I'm going to talk a little bit about the expertise stat -- I'll actually see if I can manage to burn out somebody's
retina -the expertise stat. And one of the most significant things -- this turns out to be the subject of a 2005
Physics Today article I wrote with Doug Post, who was one of the computational physicists that I met doing
this DARPA High Productivity Computing System program.
The problem is, is that it's exactly like what it was in 5ESS after I got through all the terminology and
everything else, which is every time you make a change in a computational code, you don't know if you
broke something. Put a different way, how do I trust the results?
And anybody who has ever worked on change management and configuration management systems -and I know several people at Microsoft Research spent some time at Bell Labs and everything else, and we
had a tremendous configure management systems and we could tell what was changed and so on and so
forth.
The reality, though, is you don't know if you ever broke it once you've made a change, and that's actually a
significant problem.
So how about breaking the productivity gridlock. One of the characteristics of the research I like to do is I
like to have empiricism embedded in it. So not only do you do observational studies to see what is
happening, newer experiments to identify mechanisms so you could build theories and therefore end up
predicting, I also like to do experimental studies to see if the ideas that I have or that my team has actually
work in general, in vitro initially -- in other words, in the laboratory does it really work -- and then in vivo, in
life, meaning put these back into a team of physicists doing the next generation of weather forecasting
models or whatever and seeing if it really does improve their productivity and stuff.
And, finally, some concluding remarks and then I'm going to mention my collaborators, because there are
many and they're fun. And then for those insomniacs in the audience, I have a bibliography, and they can
certainly go at it. I'm only going to go highlighting through some of these, and feel free to ask me questions
at any time either live or in etherspace there and we'll try to address them in realtime.
So I don't think I need in this audience to talk about motivation for why high-performance computing is a big
deal. I will recognize that some of the things that I mention to people in the general population, and they
don't even think about it until you talk about it, is if you looked at the number of people on our highways and
the number of traffic deaths, it's going down. And a lot of it's because of social policies around dealing with
people driving under the influence of alcohol. Others are better safety laws, and so on and so forth, and
methods to operate a car.
But another whole set of things is that, to be perfectly honest, crashes still happen and not as many people
die, and auto crash models are a big part of that. And we don't have to crash them into walls anymore.
Every automobile manufacturer now can sit there and model and run through 10,000 crash sequences of
different configurations and different intersections under different weather conditions and get some idea
about how to tune and move safety features around in a car to actually provide the highest probability of
survival.
Another one which I found is very interesting is Proctor & Gamble packaging. This is one that gets sold to
Congress a lot, which I find -- it used to be -- anybody know the pressed potato chips, the Flangles
(phonetic) that people make? Pringles! That's it. That's it. Pringles.
So anyway, their production, what was happening is that they were flying off the conveyor belt as they were
being baked and everything. So what they did was they changed the shape to reduce the lift so they could
run it faster and actually produce more Pringles per minute off the assembly line. These are neat things,
but it's a computational fluid dynamics program, and they just had too much lift in the Pringles.
But the reason I mention it is not all of these computational things have to be, my gosh, kind of, you know,
big whiz-bang things saving people's lives, but every one has a little bit of interest.
Weather modelling and prediction is obviously one. And one that people didn't really recognize, and
actually I know Motorola spent tens of millions of dollars renting supercomputers, was to actually do cell
phone layouts of the initial networks. And, of course, there were different techniques for doing that and all
kinds of things. But it's a science unto itself. Trust me, when you have scattering of electromagnetic
radiation off of water towers and hills and iron deposits in those hills, things get interesting real quick.
I don't know how many people are aware that there's a whole set of experiments continually being done on
you every time you walk into a Costco or WalMart. They put stuff in certain places to see if they can -- I
don't want to sound too cynical -- make it easy for you to buy it, I guess is the right thing. And so they
actually talk about yield per square foot in certain parts of the store and everything else. I find it very, very
fascinating.
But, literally, when I was doing some of this phase two computer stuff, obviously DARPA is always
concerned about looking for commercial applications. And one set that I wasn't even aware of was that just
the realtime capture of information at all the WalMart stores or Walgreen stores or Costco stores and
figuring out and redistributing their inventories, that that's one set of problems.
Another set of problems is just how do I -- where is the right place to put some of these things? What do
people buy when they only buy three things and are only stopping for five minutes versus stopping for half
an hour and buying 10 things? There's all these kind of things. Obviously Pixar and stuff like that, the
entertainment motion picture industry.
One that I ran into last night which was a lot of fun, and I had to put that in, was swimsuit design. Did
anybody see NBC Evening News last night? So I used to be a swimmer in my high school life, and one of
the things that's very, very interesting is this year before the Olympics everybody goes into the trials, and
swimmer's are usually the last ones to do their who's going to be on their Olympic swimming team kind of
thing.
And it turns out that if you look at the number of world records broken it's, like, dramatic. And it turns out
that NASA has been working with the elite swimmers in the United States and they've designed suits that
actually reduce the drag so significantly that more world records have been broken in the last trials of the
last three weeks than has ever been seen in the sport before, and people are now wondering what's going
on kind of thing. And it's not broken by two hundredths of a second kind of broken. These world records
are being broken by a second and a half. And at 45-second ->>: (Inaudible).
>> Larry Votta: My guess is they probably -- but I mean ->>: (Inaudible).
>> Larry Votta: Speedos. Yeah, Speedo is the one.
But I only point it out because this is kind of a, hey, if you've got to swim fast away from the shark, it might
be a really useful thing to have a Speedo suit on.
In any case, so what is the productivity problem? Well, you know, this is a slide that was put up by the
DARPA program managers and everything, and I kind of -- it's visually stunning. It sort of captures some of
the information that, gee, today we have a hundred to a thousand processors and the future is 10 to 100K,
and it really does feel like -- if you've ever done any of these computations, it really does feel like -- I'm not
sure if a dragon is sitting over the cliff. I usually think more of a snake coming up to bite me kind of thing
and tripping me up whenever I thought I could do one of these things.
But in any case, the point of this picture is, this is one visualization that I think plays well to a non-technical
audience, but it really doesn't help you at all understand what the real problem is. We don't have dragons
sitting on the other side of precipices and we, as naive software engineers, are walking right off the bridge
right into the dragon's mouth kind of thing. That's not what it really is for computational scientists.
So that's why I put it up, by the way, just to be a little bit humorous. But let's go back.
So here's kind of an interesting or another approach to the productivity problems is first principle kind of
thing. Being a physicist, I'm a first principle kind of person. You can't take the physicist out of the person,
you know? I'm sorry.
The productivity is equal to the utility over the cost. This is the standard economic definition. And it's
conceptually a great idea, value per unit cost increases, per capita wealth increases.
The only problem is that things always get renormalized, okay? Inflation. You know, 1971 dollars are a lot
different than 2008 dollars.
And definition of value. I mean, there are technologies that have grown up just in the last 50 years that
completely invalidate any of the value propositions of organizations and the way things were done in the
'50s and '60s.
So the bottom line is that it's always a continuously renormalizable kind of relative ratio measure. And what
it would mean for computational science is kind of a good question.
And so from these first two slides what I'm trying to motivate is, the first thing that really we have to do is we
really have to get underneath the hood and figure out -- push the metaphor of an automobile -- but to figure
out really what the problem is. And so we're going to do that a little bit and I'm going to sketch out how we
went through some of this.
Okay. So if you come at it from exactly a computational scientist's viewpoint, this is what is -- sorry about
that for people online there. I think I was moving my beard around -- the first is, if you look at the problem,
you have a performance challenge. Designing and building high-performance computers is a hard thing to
do. You have a programming challenge. You want to do rapid code development. If you're doing
application development like, for instance, either investigating new science or building new application
programs for doing new kinds of computations, you have to optimize these codes for good performance.
And then you have the prediction challenge, developing predictive codes with complex scientific models,
develop codes that are reliable and have predictive capability.
Now, the problem is, is that -- I can tell you right now that in 5ESS, any major release of 5ESS software,
over 50 percent of the effort was spent in verification and validation, all right, just to give you an idea. In
other words, did it work all the time, was the system designed doing what it was designed to do.
The computational science, depending on the domain I'm in, has, you know, different types of cost benefits.
For instance, one of the biggest growth industries, I think, besides entertainment, which I think is one where
high-performance computing is really getting into and really making headway, is what I call computational
medicine where actually some genomic marker information of what types of protein is being expressed
from your DNA is actually input into some set of computational models that allow you, for instance, then to
actually adjust treatment.
And you're starting to see some of this in some of the big -- and this is really interesting for people who
want to read about this, I'm not sure there's a book out about it yet, but the war on cancer has been a very
interesting kind of thing because there people wanted to cure cancer, and it turns out to be sort of similar to
a lot of other problems. Well, it's great conceptual goal. The problem is we don't even know what cancer
is, so curing it is kind of a hard thing.
So in the last 40 years president Nixon in '71, was it, or maybe 40 years ago almost now, declared the war
on cancer. And it was a great war, but unfortunately we didn't even have the science there yet to
understand what was going on.
But bottom line there, verification and validation is going to be kind of important. And actually you'll see, we
actually went in and studied a computational biochemist to see how they would rewrite an amber code. An
amber code is an electrostatic potential code of complex proteins. The bioactivity of a protein is dependent
on how it folds in a three-dimensional space when it's in equilibrium with the radiation field. And actually it's
not one shape but it's many shapes.
And the reality is, is you need to have -- you need to do an optimization problem because the principle of
least action sort of indicates that the shapes that it will fold into are the minimum electrostatic potential
configurations. And we were actually working with one of the top people writing that particular code.
But that's sort of the basic science that has to be done before you can start writing an application that might
simulate, and then, of course, eventually get to this vision of a computational medicine kind of thing.
It also highlights the kind of problem and evolution you're looking at at any given time when you go in and
look at the high-performance computing community is that I don't think there's any problem where there are
multiple scales involved in both time and space where things are approximated.
One of my collaborators, Doug Post, tells a story about being around, and he worked at Los Alamos and
Lawrence Livermore, and he was talking about they would have a computational science meeting once a
week, and they would be simulating how nuclear weapons worked and everything. This is back in the '50s,
and Edward Keller was one of the physicists who was doing that and was for many, many years the
director, and Edward Keller has a lot of history in this whole arena.
But one of the interesting things that happens and that you see even in the '50s, and they realized this, is
that you couldn't do the computation exactly. And so what you were always doing was finding good
approximations.
And the story that's told by one of my friends is that he was a young graduate student up there for the first
time trying to defend this set of approximations, and Edward Keller is in the back of the room and says,
"Approximation has no physics. Throw it out of the code."
But the problem is, is it seemed to work because what they had done is they had experimentally identified
it, which brings about a lot of interesting questions about verification and validation.
So I talked about it from the scientist's point of view, but what does it really mean? I mean, I put up those
three things, and yeah, we have this continuously changing set of platforms, we have the continually
evolving and maintaining code base and everything.
The reality is, is some of the symptoms you're seeing in long and troubled software developments, you're
seeing dysfunctional markets to support tools, you're seeing scientific results based on software,
sometimes not always correct. In fact, some of these have been tremendously tragic, and gosh knows how
much it's really going to end up costing the U.S. economy at some point.
Scientific programmers have some interesting views: Computer scientists don't address our needs, there
isn't enough money. And really, as we understood and started studying this community, we understood a
couple of things. So we actually -- and you'll see my sidebar in a minute, which is kind of interesting how
we study this kind of thing. But what we ended up observing was a communication gap. And so let me talk
about the communication gap.
Yeah, Tom?
>>: (Inaudible)?
>> Larry Votta: Okay. So Tom's question is the scientific results based on software.
I'm on slide nine, so let me go and I'll show you exactly that. Okay?
>>: (Inaudible)?
>> Larry Votta: Yeah. The scientific results based on software is the problem of the verification and
validation. Okay? In other words, what happens, you have a code that tells you this, this, and this and you
go out in nature and that's the way this, this and this turns out to be so the code's correct. It then makes
another prediction that isn't correct. Okay?
However, what happens in scientific results based on software is that it not only becomes a tool for an
engineer to use, it also is a tool for scientists to use to figure out what the laws of nature are. Okay? And
sometimes you're wrong.
And a lot of times, like in experimental methodologies, we have hypothesis tests and we say, okay, we
have a confidence level and so we set up basically a statistical gain and then we can probabilistically
reason about whether we can reject the null hypothesis and so on and so forth. Okay?
>>: (Inaudible)?
>> Larry Votta: Okay. The symptoms of the productivity crisis are -- basically the worst case in productivity
is you get the wrong answer because no work actually got done and zero value got added, right, because
the numerator ->>: (Inaudible)?
>> Larry Votta: The wrong results, correct. Incorrect result, right? Got it? Yeah. Sorry. That's a good
point. Yeah.
Okay. By the way, at the end of this talk people have access to the PDF online and here. I picked out two
cases of exactly that. One is a plasma physics result that was widely believed that stopped the United
States from entering into the ITER program, which is the international collaboration to build a
demonstration fusion toroid power generation device, and the second one was the Challenger Shuttle had
a program called Crater, which looked at the impact of micrometeorites on the heat shield of the Space
Shuttle wing leading edges and made certain predictions.
And, of course, we know that one was very tragic. The Crater indicated that there wasn't going to be a
problem for the Challenger to come back down. And so, consequently, nobody ever thought, well, what if
the computation is wrong, because the computation had always been correct up until that point. So it was
tragic at many different levels.
So let me talk about the communication gap. Two cultures separated by a common language. It's kind of
interesting. And in fact, one of my collaborators is a cultural anthropologist, and she loves to -- when she
gives this talk she does it much better than I do.
And what she talks about is that you have these people that speak the same language and can't figure out
why they aren't communicating. And that's exactly -- you know, the software engineer thinks they fixed the
problem already, the scientist just rolls their eyes in exasperation and tells them exactly what's wrong.
And so it's a conceptual disconnect even though the language is well understood by both parties. And it's
very fascinating. And it really gets to the point of exactly what you see when you see a meeting of software
engineers and computational scientists.
So scientific programming really -- what the software engineer really has to start thinking about a little bit is,
it's all about the science. That's what they think about. Or the engineering if you're talking about the
running of a crash code or a title search program to see what's going to happen as a Category 4 hurricane
comes up the Mississippi River valley and so on and so forth.
Scientific and engineering codes are expensive. Codes live a long time. Partially because the verification
and validation strategy of any code is a very long time.
Performance really matters. The value, almost always, of those codes, they're pushed to the limit of the
computational science.
For instance, current weather prediction models would love to be able to hit the tornados and other violent
storm events that occur in the western part of the United States. It turns out there just isn't enough
computational ability to give enough resolution to look for the formations of certain micro-storms. There
just is not.
And, trust me, it's been years of research and adaptive grids and adaptive time models, so on and so forth.
It's a real issue and will continue to be a real issue. It's also for high-energy physics. There's another
whole set of problems in that as well. Hardware platforms change often.
Think about the case where the hardware changes two or three times in the development cycle of your
software rather than the other way around. Okay? It gives a whole different set of project management
problems and portability problems and so on and so forth. It's all for tran 77 and C++. That's what it is.
And this is partially because of the legacy codes.
And, in fact, this is very true of 5ESS and very true of any large legacy software system that lives a long
time. The tools get frozen in so the ability to do abstraction automation and any productivity that's
associated with building or modifying or evolving that software gets frozen in at the time those tools are
originally decided.
Because I basically ended up trying to put C++ in 5ESS, and I actually did some experiments on 5ESS,
and it was the most difficult thing because it was like first you had to get the runtime environments right,
then you had to get the debugging tools right, then you had to get the automated libraries that brought in
the pieces that it changed. I mean, it was just one thing after another, and you could spend years working
out all the problems.
Hardware costs dominate and portability really matters. If you're going through several generations of
hardware in the course of the development of the software application, portability is a big issue. Okay.
Yeah?
>>: So while I agree that the codes can live 10 to 15 years and so the hardware platform that you're
actually running on will change over that pattern, the class of the hardware doesn't really change. I mean,
we've had vector machines about 15 years and now we've had cluster machines about 15 years. So the
general class of the machine and, therefore, the way you have to program or write the programs, previously
it was vector-type loops and now it's all about communication and data distribution. That doesn't really
change. So, you know, I quibble there about saying it's really the platform changing. Because we've had
message passing on distributed memory codes for about 15 years now.
>> Larry Votta: Yeah. Yeah.
>>: And then the hardware costs dominating, I mean, sure, a certain computer can cost about a hundred
million dollars and it's bought maybe once every three or four years, but the programming costs at the
national labs, I mean, each of the budgets there is about a billion dollars per year, and there are about
three of them. So, you know, compare a $3 billion personnel cost per year versus a hundred million dollar
hardware cost every four years ->> Larry Votta: So let me abstract two of the things because we could end up spending the rest of the time
discussing the two points you brought up.
The first point was that the style of supercomputer machines hasn't really changed, and actually their
architecture is relatively similar, and there are a lot of things around that that have allowed some amount of
abstraction and automation to be developed and, therefore, some productivity. And I agree with that. It
gets a bit more complicated because different applications do different things, and we can talk about that
offline.
The second comment has to do -- help me out. What was it?
>>: The hardware costs.
>> Larry Votta: The hardware costs dominate. It's a perfect observation and one that is very social
science and political oriented and not technologically oriented in the sense that the national budgets of the
national labs is very, very diffusely connected with the acquisitions of these systems and everything. In
other words, they all have to be approved by Congress, and there's influences there that don't allow for
direct feedback loops of improving productivity with the machine design.
And a perfect example is the new technologies that really need to be done on machines that are petascale
where you have a hundred thousand processors, you have to start talking about virtualization, runtime
virtualization, you have to start talking about -- the machine has to be able to keep track of itself in the
sense of the switching system, which could always tell if it was partially broken or not.
The technology has not come into normal computing play in general except very recently in certain types of
networking things and stuff. But those are very, very good comments.
Let me give you the idea -- first off, 35 years. We went in and studied several of the DOD/DOE benchmark
codes that are used for all kinds of things, modelling the deterioration of materials and holding radioactive
waste, so on and so forth. And the bottom line is, is basically 10 years is about the development cycle of
these things.
And, in fact, right now this is the current projected profile of the Falcon, which is a new code -- I won't say
what it does because it's a DOD code -- but, in essence, this is its model, and this model is based on
historical, what they do with other codes.
So initial development typically takes five years. Serious testing by customers is another five years. You
then get into this maintenance and evolution kind of cycle from 10 to 25 years, so 15 years of real useful
stuff, and 25 to 35 years you get into a retirement phase. But, in essence, for 25 years you're getting useful
results out of these codes.
And so part of the problem is software engineers haven't seen this. Online transaction processing with
checks, so on and so forth, didn't last this long. It's been completely replaced with Java and other runtime
environments and transaction processing systems.
That's the problem. And why is that? Partially because for the scientists, the verification and validation
problem is a tough one, making sure that the code is really doing the right thing and doing it correctly.
And by the way, this is much more reminiscent of what 5ESS looks like, which started about 1978 and is
still being sold by Lucent-Alcatel Technologies today, which 1982 is the first ship, so we're at 2008, so 25,
26 years. So we're just about probably at this part. And 5ESS was probably the most successful digital
switch.
Anyway, let me continue on. The productivity gridlock then is a programming -- so we're still peeling away
this onion. We sort of see these things. Some of the assumptions don't match what software engineers
are doing.
But, in essence, what we actually ended up doing is going in, watching and observing scientists doing their
work and engineers actually using crash codes and other things to see what they were doing.
And the programming workflow has some bottlenecks in it. Developing the scientific programs, the serial
optimization and tuning, the code parallelization and optimization and the purporting. And these really turn
out to have two fundamental things. The manual program effort -- in other words, we still haven't gotten
enough automation here -- and the expertise.
And let me say a little bit more. Now, I want to talk about -- I'm going to come back to that in a minute, but I
want to talk about how we investigated this for the five years we were working on the problem.
What we did is embraced the broadest possible view of productivity, including all the human tasks, skills,
motivations, organizations and cultures -- therefore, your question about the national labs having a
$1 billion software engineering budget and only buying $100 million machines is a pretty valid model. The
problem is that it fits an organization in a set of cultures that don't create the feedback loops that are
needed for evolution of the general system -- put the investigation in the soundest possible scientific base,
both physical and social scientists methods, and this results in a three-stage research framework which we
found very useful.
So the stages are explore and discover, test and define, and evaluate and validate. And the goals are
develop hypotheses, test and refine models, replicate and validate findings.
And we use a series of methods: Qualitative, qualitative and quantitative, and quantitative. And so we go
the whole spectrum, everywhere from ground theory all the way to quasi-experimental designs where we
tried to go through more of a biological metaphor of in vitro/in vivo, like what happens in software
engineering now. If you're a tool builder -- so you have a hypothesis that's providing this abstraction and
automation, the abstraction offered by the tool. The automation offered by the execution of that tool is
going to improve productivity. You don't first go and try to sell, let's say, Office to use this tool and convince
their management to switch over to it in one day. What you do is you actually work on the tool and you
develop in vitro -- in other words, outside of the main development lines and everything else -- the fact that
it works and everything, and then eventually you have an adoption strategy and you move it in in a rational
way.
And, in essence, that's sort of a biological metaphor. You start with an in vitro version and you move to in
vivo, and you try to keep it quantitative, and sometimes you use qualitative methods because your
experimental designs assume certain things, and sometimes those assumptions are wrong and then you
have to go back. So this isn't a straight-line process as much as it's often iterative in its nature. And I
already mentioned that.
So let me go through this a little bit. So what we -- there are two major workflows that you see in scientific
computing that are troublesome. The first one is the scientists or set of scientists building the original code
and doing the science programming or investigating the science. So they're in this continual very rapid
evolution kind of thing of their application.
The other phase is where you end up having a code, like a crash code, and now you're just going to crash
the automobile, but now you want to do different automobiles and now you want to do different crash
situations, and so consequently the basic finite element model and metal deformations and all that stuff is
pretty much done. What you really need to do is just change the databases and everything.
So one is like developing the codes and the other is really running the codes, let me just say that. And we
tried to capture this a little bit. But this is the general model, meaning it also includes developing the code
originally.
And what you discover is that not only do you need the main scientists, you also need people who really
know how to write software, and then you need people who really know how to manage and maintain and
evolve software, and then you need to have people who know how to take an application and actually tune
it. There's a lot of different very highly specialized skills here.
So there are four distinct skill sets. The main science, the scientific computing -- just mapping it onto a
cluster and understanding the reliability performance characteristics of a cluster and how you might want to
organize your application on them is kind of an interesting set of challenges -- the scaling is also another
piece of how do I get that, and then the management.
So the skills are useful when they are synchronized through communication, collaboration, or exist in one
person. And I love this. What you observe, of course, is the teams that have good communication
protocols and everything and work together as a team can do this a lot better than a group that are very
disjointed and so on and so forth. And it even works best when you put the skills all in one person because
then we only have to apply the fact that the person is sane and -- well, we made an assumption there.
But the reality is, is that what you find out is two skill sets are rare. Think of somebody who is at the lead in
scientific computing and also in the domain science at the same time and writing in both literature bodies.
Just go and look at people who are there, okay? Three skill sets are very rare, and four is like a moon
rock. Okay?
So that, in essence, is the problem. And the reality is, is if you think about the social institutional evolution
of the labs, the DOD labs, the Department of Defense labs, the Department of Energy labs and so on and
so forth, it's not one that really lends itself to great teamwork kinds of things, although that's what's been
asked of a lot of them. Okay?
So if you had to say, then, in a nutshell, what is the productivity gridlock, the productivity gridlock really
comes about because of the need for very, very highly specialized sets of skills being very, very tightly
coordinated and orchestrated to maintain and evolve either the software or to apply it to different situations
and run it for different cases.
And almost always, because these machines, the clusters that we talked about, the message passing, so
much of that stuff depends on the nature of the geometries of problems, for instance, in finite element
modeling and so on and so forth, that it's not just a simple reapplication, it's almost sometimes the whole
retuning of the application has to happen.
If you're going to change, for instance, way the car crashes, you might want to take -- instead of doing car
crashes, think about just crashing 24-foot motorboats. Same kind of poles and, you know, more water in
there, but you can start seeing it's a complete redo of everything because of a lot of different pieces of the
application, although a lot of the finite elements and the basic science is all the same.
So I'm going to talk about -- so one other thing that we discovered is it's really kind of tragic, and what's
going on is the scientist have tool complaints. And the reality is that the productivity on being able to either
evolve and maintain an application or to run in another thing, those abstractions and automations, which is
sort of the bread and butter of software engineering, are, in essence, encapsulated in tools. And what we
have in the scientific community is tools that are hard to learn. They don't scale, differ across platforms,
poorly supported and too expensive.
And, in essence, what we're looking at is probably to a certain extent the scientists not understanding, nor
the phenomena of being understood, that once you use those tools to develop your initial application, they
became frozen in, and the productivity proposition of those tools actually is frozen into that application
forever basically. And then you have to develop a business model that maintains those tools for 35
careers. That's how long it took. Or at least have a maintenance and evolution strategy for them.
And what you see -- where does this come and get you? So what the software engineer doesn't see,
perhaps, is that in some fundamental sense the 35 years is a big deal. Okay? And that general computing
IDE's have different assumptions. Lifetime in code slows evolution of tools. The field is small and
specialized, so from the current business models that's a really bad thing because there's no scale
involved, so consequently these things are very, very expensive to build, and investment in tools is
insufficient.
Another important thing is -- this is really true of when we were studying some clusters and people were
using different Intel processors and then they ran over and were using AMD processors, and there were a
set of compilers that were being made available by Intel Research that were doing some very, very nice,
elegant kinds of parallelization detection and so on and so forth in old Fortan 77 codes and C++ codes.
And they actually got removed from the market because they didn't -- they were actually being maintained
by Intel Research, and they decided they weren't going to go that way architecturally so they just wanted to
take them off.
So sometimes even these things for business reasons get removed. And since they weren't part of the
open group community, it caused a really great problem in how clusters were being used for certain types
of computational science computations.
So this is revisiting that early point I made about the science coming out wrong because the computation
really wasn't valid. And, really, it comes down to -- here's what the issue really is, and scientists have to
think about this a lot harder. Trust in the validity of the computational outcomes is a key productivity issue.
Because if it's wrong, you've got no productivity. Okay? The value is zero. Okay? And you could have
spent whatever you spent.
How do scientists build confidence in their codes? They look right. These are real comment when you go
and ask the scientists about it. Four or five years to get some confidence in the code. In other words, they
run it on a lot of different problems they can figure out and everything, and then, you know, you sort of -what you find out, of course, is that the domain scientists, the really good domain scientists, have a set of
ideas about the verification and validation sequence because they understand the weaknesses of
computational strategies and so they even have some ideas about how to do this. But no way is that
codified or shared in the community. So not even common knowledge like that gets shared.
Oh, and this comment was really great. It's sort of an inverse flip. We were talking to some people about,
well, what happened when you abandoned this code and everything? One guy was -- it was actually a
woman scientist. She was very upset because she had just finally gotten to understand and have some
confidence in the code and they decided not to fund it anymore. And now she had to start out at the
beginning again, which was pretty -- it was very hard.
So the point here is, the problem is that you always have to be careful about this just like you do in any
experimental methodology. But it's an area where I think scientists have to innovate. And they manage
threats to validity, meaning the interpretation of a particular experiment. They just have to get there, and
work's needed here.
So breaking the gridlock in software engineering. So the answer is, is it's not like software engineering can
say, here, take two of these and come back and see me in a week. Okay? The reality is, is that the two
communities have to work together, and they have to sort of close that gap in communications that has
grown up between the two of them.
And in some sense software engineering has a general paradigm of automation, abstraction and
measurement to introduce, and the scientific programming community has to start thinking about
investment and modernization. And there are some common things that they have to work on together, like
how best to use computers to help with the verification and validation program.
Let me give you a little anecdotal story about how 5ESS used to do it. We built computers to generate load
on a 5ESS machine. We had switches that basically did nothing but make telephone calls on a 5ESS
machine so we could test it.
Well, you know, that's one way of doing it. Not the only way of doing it. But those are the kinds of things
that you have to start thinking about because, after all, there is no productivity proposition if the end results
are incorrect.
So let me talk about some empirical studies. I like doing empirical studies. I am an empiricist at heart. I
put up these two myths, and I wanted to be a little bit controversial, and I'll tell you what the experiments
were and so on and so forth.
So what we did was in the computational science arena we actually identified some domain scientists who
were building computational codes for a plasma fusion machine for toroids, and in fact we have -- myself
being a high-energy physicist, Eugene
Lowe (phonetic), who is one of the other collaborators here. Eugene has his Ph.D. in Computational
Physics from Cal Tech. And so Eugene actually went and became an apprentice to the person at
Princeton Plasma Laboratories who wrote the code on -- and I'm not even sure. I'm going to say geokinetic
whatever. But anyway, it models the energy loss mechanism in a tokamak.
And so what Eugene undertook to do after he did the NAS Parallel Benchmarks is he wanted to go and see
if he could rewrite the code to be more expressive and much more productive. And so we actually went in
and started doing some of this. We also had another scientists who was doing the electrostatic potential on
codes. The net-out was that we were able to improve productivity by about a factor of 10, and I'll explain
that experiment in a little bit.
Another computer science myth is the prescription of implement serial version then parallels for multicore
and clusters. Experiments with programming teams indicate this is the wrong strategy. Suboptimal
solutions are achieved. Walter Tichy at the University of Karlsruhe has actually started doing experiments
on this. And at the International Conference on Software Engineering there was a workshop before the
conference on multicore systems, and Walter showed me some of the results of some of the initial
experiments, and it certainly indicates that the teams that achieved the best parallelization got the best
running code with the fewest defects, and so on and so forth, were the ones who tried directly to go to the
parallel implementation.
Apparently the serialized version creates certain barriers because now it works and it works good enough
so you don't actually go back and redo certain things. So you make these engineering trade-offs, so you
end up with a mixture of serial and parallel. And it's kind of fascinating, but it's not -- it's past the scope of
this, but it's an interesting kind of investigation that software engineering can do to get to the heart of this,
which is actually one that I think is kind of fun.
So what we did was looked at the NAS Parallel Benchmarks and port DF77 version to a more modern
Fortran 90 version. We did these four things. The result: We reduced the source code by about a factor of
10, and because we reduced the source code by a factor of 10 we inferred that we reduced the cost by
about a factor of 10 since any of the Kokomo cost models or anything for software development all
basically are linearly related to the size of the software and these scales.
And something else surprising happened. Here's one of the NAS Parallel Benchmarks. Now, the NAS
Parallel Benchmarks takes some of the more successful computational fluid dynamics codes or other
kernels of the application and uses them to benchmark different architectures, and so on and so forth, and
they're all written in Fortran 77. And then they also have a specification, and they talk about some type of
iterative methods to solve things using multicomputers, and they have MPI implementations and so on and
so forth. But what scientists have been trying to do is, obviously, even keep that at a low level and just give
you a cleaner representation.
Here's a Fortran 90 high productivity version of this. The reality is, is that -- the point being is that Fortran
77 -- and you saw earlier that if you go and look at the NAS parallel benchmarks for Fortran 77, you see a
lot of configuration testing, machine testing. There are a lot of non-portable elements in it that have been
sort of (inaudible) out by testing certain machine configurations.
But the bottom line is, is that this is much harder to understand and match to this specification than
something like this. And the reality is, is that what software engineering knew already was that
representation is a big deal. And not retranslating the specifications over and over again is another big
deal. In other words, matching the actual representation that gets compiled to the formal representation.
And this is the sort of work that Parnis (phonetic) has done over many years, and John Gannon I think was
the one who did the thing just about simple comments in the mid '70s.
But the bottom line is that you can see that not only is the code much more compact and clear but that at
some point I don't have to worry about the verification and validation problem. I have to prove that my
compilers do the right things, but I don't have
to show -- but just by general inspection I can reach the conclusion that I've implemented what the
specification asked.
So concluding remarks. Well, of course, yes is the answer, and computational scientists need to realize
that their greatest strengths may be their greatest barriers to improvement, and software engineers need to
rethink some of their solutions, and the two communities have to work together. And that's really the
bottom line of the talk.
And hopefully -- I wanted to mention some of my collaborators. The particular coauthors on this paper,
which was written for the IEEE Software magazine that was just out this month, I think, is that -- and, of
course, we didn't get it accepted -- was that -- it was Stuart Faulk and Susan Squires and Michael Van De
Vanter. Susan, Michael and I formed the productivity team at Sun, and then Tom Nash is a physicist, Doug
Post is a plasma physicist, Walter is at the University of Karlsruhe, Eugene Lowe, Declan Murphy, Chris
Vick and Allan Wood are at Sun Microsystems still, and all were involved in the productivity work and some
of this.
You can actually go and look up a lot of these. A lot of individual results are written up, and I would prefer,
rather than deep-diving into one, I just wanted to sort of give an overview because to me there's two
important elements to this research: The composition to see the whole to answer a big significant question,
and the second is how do you do little studies in that framework to actually help yourself and make sure
your understanding is correct. So the bibliography talks about some of those important studies and case
studies we did, the different communities that get involved and this kind of stuff. I think I gave a pointer to
Walter's paper on multicore programming if people are interested in looking at that. And that's actually a
very new result. That's a result this year. But it is something that occurred to us to ask the question a few
years ago: Is that the right prescription, do the serial version and then rewrite it for parallel. And it doesn't
look like it is the best solution to the problem.
Let me see. Was there anything else I have? No, that's it.
Questions?
(Applause)
Download