Rob DeLine: It`s my great pleasure today to introduce

advertisement
>> Rob DeLine: It's my great pleasure today to introduce Steve Easterbrook from the
University of Toronto. He is quite the rarity in software engineering circles. He is a
software engineering professor that likes to study how software is actually made by
watching it being done.
So today he's going to show us how he's watched climate scientists produce software and
what it looks like from their perspective.
>> Steve Easterbrook: Okay. Thank you. I don't know how well you can hear me. I
assume the mic is picking up something.
So what I'm going to talk about today is a two-month observational study that I did last
year at the U.K. Met Office, Hadley Centre, looking at scientific software and how it's
built. In this case a large hunk of Fortran.
So what I'll attempt to do -- whoops, I'm confusing myself now -- is try and explain why
climate scientists build software, what this software is for, what is it supposed to do for
them; a little bit -- I'll have to give a little bit of the background of what this software is,
what's the domain and what are they simulating; and then the core of the study is how do
they build software, what are their software practices, and I want to draw out some of
what they do that I find interesting from a software engineering perspective; and then talk
about, well, can we as software engineers help these scientists.
And if the answer is no, and in many respects unexpectedly it is no, what else can we be
doing and how else can we engage with climate scientists, given the importance of the
field.
So my motivating question going into this study was I'm a software engineering
professor, these people build software largely without any software engineering training.
Surely we must be able to help. We know how to build software.
So that was my motivation going into the study, to look for places where software
engineering tools and techniques could help these people.
And going into it, we didn't know much about what the software engineering practices
are amongst computational scientists. There have been a few previous studies, not very
detailed studies, and there are only -- probably I can count them on one hand. And they
point to things like the model is -- model is meaning the people building these large
simulation models, don't have a software engineering background. Their code is very
long-lived; it lives for decades and they keep tinkering with it for decades. It's highly
optimized typically for high-performance machines.
And these people resist software engineering tools. Their experience in the past of
software engineers coming in and saying, oh, you should be using our IDs or you should
be using these various different tools, they find that they're not well suited to their context
and they very quickly get fed up of software engineers coming in and trying to tell them
what to do.
So there's this immediate suspicion. So when I first made contact with this particular
group, there was this initial, yeah, what are you going to tell us that we don't already
know.
And when I started to explain that what I wanted to do was not tell them how to build
software but learn how they currently built it, then we got over that initial hurdle and they
said, yeah, we'd love you to come and see what we do.
The code, of course, is nearly all written in Fortran. And there's no way they're going to
change that in the foreseeable future. Fortran is highly tailored for what that need for two
reasons. One is they have the experience. The entire community knows Fortran and
doesn't know other languages. And the other is that Fortran is probably the best suited to
what they do, which is they take the scientific formulas, convert them into code and run
them on the high-performance machines. It's hard to imagine a language that's better
suited to that right now.
So they prefer to build their own tools rather than use anybody else's. And they have this
worry that because they know any tool that they adopt they're going to be using it for
decades, they have this worry that if they buy it off a vendor the vendor will disappear
after a few years and it will be unsupported. And that leaves them then with a problem.
So going into the study, I had this naive idea that their process is the climate scientist has
some model in their head, some scientific theory in their head that they need to get into
code and run a simulation to produce some results, and that's their scientific process.
And that there are two key things along this path that could get in the way of their
productivity. One of them is the high-performance computing question: How do we get
the most juice out of our hardware, how do we get the simulations to run fast. And the
other is the software engineering question: How do we build code that we can trust as
quickly as possible.
And one observation that's been made of computational scientists is that they're spending
ever more time on the software engineering question, getting to working code.
And because of that they're not able to take advantage of Moore's law, so they're not
getting the improvement in scientific productivity that you'd expect from faster machines
because they're being slowed down by how long it takes them to build the code.
So which means there's this gradual shift in -- I mean, there's been tons and tons of work
in high-performance computing, and there's this gradual shift to say actually that's not
now the bottleneck; the bottleneck is the software engineering part. And we haven't spent
anywhere near enough time doing that.
So that's some of the background to the study.
If I'm going to ask eventually about software quality, then I have to worry about a frame
of reference for that question, what does quality mean to this community.
And you're going to ask climate scientists about software quality, and they immediately
do a mental translation in your head to the term that they usually use is model skill.
So how well is the model simulating something about the real world that they're
interested in. They don't talk about code quality. They don't talk about the number of
bugs, defects, anything like that. They talk about skill. And to them that's what quality
means.
And so that means they don't ask questions about scientific productivity. You know, are
our code practices slowing us down. They don't ask questions about understandability to
code, can other scientists understand what we built and modify it.
They don't ask questions about reliability and -- or they do ask questions about
portability, because portability nearly always trips them up. Every time they upgrade to a
new supercomputer, which happens every five or six years, they spend -- everything in
the lab stops for six months while they port the code to the new architecture.
They worry a little bit about usability, but they don't do anything about it. These models
are held to configure and run. I tried it. And I couldn't do it. If we got time later on, I'll
show you the user interface for model configuration there.
So I want to define quality, then, as fitness for purpose, which is my favorite definition of
software quality. Fitness for purpose for this community means how good is it as a
scientific instrument. And the interesting thing here is quite often the utility of a model
does not depend on how faithfully it captures something about the real world. So climate
scientists are building simulations of the climate.
But quite often what they want out of a model is not an accurate simulation of a real
climate. They want the ability to ask an interesting question.
So I was talking just before we started with a couple of you, one of the things they'll do is
they'll take a climate model and they'll remove all the continents so we have an ocean
world and they'll play with that. And quite clearly that does not represent a real-world
scenario. But it's an interesting and useful scientific instrument. It allows them to ask
questions that they wouldn't otherwise be able to ask.
So quality doesn't always mean is it a good simulation of the world; it means is it a good
tool for checking my understanding of how the physical processes work in the world.
So that assumption of course is just built deeply in their culture. They know their models
are wrong. They know these models are inaccurate simulations of a very, very complex
physical process. And everything they do is based on that knowledge.
So the other thing I should say is for the study that I did how I approached the idea of
software quality -- and there are essentially four different ways of measuring quality. I
think this is set out very well in Andreas's [phonetic] book. I think this is where this
originally came from, although he didn't have the pictures.
Most of what I focused on is process quality, because that was what was observable of
these people. I went in and looked at their processes and said how well do they match
what we think we know about good software processes and where do they differ, where
are they doing things different from what the literature says should be a good software
process.
I also spend a little bit of time looking at quality and use. When they try and use the
models, what happens. And I didn't spend much time on the other two.
One of my grad students is doing a follow-up study running static analyzers over some of
these climate codes and trying to correlate observations of software defects with what
we've seen in the process quality and the quality and use. And that's a fascinating
follow-up study. I'll mention that briefly at the end.
So I started my study with five initial questions. What do they understand by correctness,
which then boils down to how do they know they can trust their code. How do they
ensure reproducibility, because if they're running scientific experiments on this code, how
do they repeat an experiment. How does a large community of scientists engage in
building these codes develop a shared understanding such that they can coordinate their
activities. How do they prioritize the work, how do they figure out what to do next. And
how do they debug the code.
So those are my guiding questions. But I should also point out I approach the study as an
ethnography, as me going in as a stranger, as someone that's not familiar with this
domain, looking for things that were surprising to me. And many of the things I looked
at were not surprising to them at all. They said, well, of course that's what we do.
So one of my guiding principles was surprise to me as an outsider. And so many of the
things that I'll remark upon were things that I found strange. And other people might not
find them strange. So there's a lot of my bias in there, as you'd expect in an ethnographic
study.
So before I show you what I saw happening, I should just explain what this software is
that these people build. So this is a little detour, which could be arbitrary long,
depending on how interested you are in it, of what this code is.
Sorry, I shouldn't go there quite yet.
So, first of all, it's very important to understand that on knowledge of climate change, the
basic knowledge of climate change does not derive from these models they build. It
derives from the basic physical properties of greenhouse gases which were all known in
the 1800s, so all the basic properties of greenhouse gases were worked out
experimentally in the 1800s.
The first calculation of climate sensitivity, which in physical terms it boils down to the
question of if you double the concentration of greenhouse gases in the atmosphere how
much temperature rise do you get. That was all worked out in the 1890s using pencil and
paper from the basic physical equations.
Okay. So it was all worked out from first principle. So we knew -- and the number that
they got back then, which was about 3 degrees centigrade, is consistent with what the
very latest IPCC forecasts say, it's within the range of the error. So that's been known for
well over a hundred years.
So our basic understanding of what the burning of fossil fuels and emission of
greenhouse gases does to the climate does not depend upon these simulations. Okay.
So why do we need models? Well, why we need models, then, is to pin down our
understanding to some extent of the consequences and to make sure that we do
understand the physical processes, in particular we understand what happens at different
time scales.
So although I can calculate what the temperature rise is for, say, a doubling of CO2, I
don't know what time frame that will occur, because there's a lag. So how long is the lag.
How long will it take.
Looking at long-term tends and looking at the regional impacts, because that 3 degrees
temperature rise does not happen universally across the globe. So where was it happen.
The poles experience about twice as much temperature differential as equatorial zones.
So the temperature changes happen differentially across the globe.
So that's what we can use the models to understand.
And to separate out different causes, to separate out a number of different forcings that
are changing the climate and start to figure out which ones of them are having the biggest
effects.
And one of the things you can do in the model is turn off things. In the worst case you
turn off the sun and see what happens. But you can turn off other things. You can turn
off humans. You can turn off human emissions. You can turn off all sorts of things in
the model and look at the differences and just play with them.
Of course what the policymakers want are these last two things. They want to understand
strategies for mitigation, for reducing common emissions, and adaptation to changing
climates. You know, where are we going to get flooding, where are we going to get
sea-level rise, what infrastructure changes are needed, and what policies should we put in
place.
And now we've got a huge tension because the climate scientists are comfortable with the
stuff at the top; the policymakers are demanding the stuff at the bottom. And the
scientists would rather not run their models in a predictive mode. They would prefer
never to have to project forward and say here's what's going to happen over the next
century. They would prefer to play around with poking and prodding their models within
observation datasets that we have from the past and checking their understanding.
So to them the model is to check our understanding of some physical process. It's not to
make predictions for the future. Because making a prediction for the future, even 20
years out, to a scientist is useless. You'd have to wait 20 years to find out if you were
right, and by then it's not a publishable result anymore.
So those kinds of predictions aren't what they do. And hanging out at the lab and
listening to them talking, they were just getting ready for the next round of IPCC
reporting where IPCC sets a whole bunch of simulation runs that they would like to have
to put together the report.
And the scientists are sitting there saying how can we get these done as quickly as
possible so we can get back to doing science. And that's their philosophy: We want to
get back to doing the science. We don't want to hand the policymakers the stuff they
want. And if we do, let's make that as quick and simple as possible.
So here's the core idea of what a climate model is. This is a very, very simple model.
You just add up energy as it moves around the planet. So what's the total energy
incoming from the sun. What gets reflected by various surfaces, reflected by the surface,
reflected by the clouds. What gets absorbed by gases, when does that get released again,
where does it get released. So it's how does energy move around in the system.
And when I said there the initial numbers for climate sensitivity were worked out in the
1890s, this was basically the equations they were playing with, what's the energy balance
of the earth and what's the new temperature that you have to get to to make sure all the
energies are in balance if you change the composition of the atmosphere.
Now, of course that's a very, very simplistic model. And what you really want is a whole
bunch of other feedbacks in there. For example, feedbacks where if you melt the ice at
the poles that changes the albedo, that makes -- that replaces white ice, which reflects
sunlight, with dark seawater, which absorbs sunlight. So, well, how does that play into
the system.
If you melt the permafrost, that releases methane. How does that -- and methane is a very
potent greenhouse gas. How does that play into the system.
If we change land juices, if we start planting new trees everywhere, how does that play
into the system. If we cut down trees and so on.
How do volcanos, volcanic eruptions perturb the system.
So these are the stuff that the scientists want to play with. They want to play with all
these interesting feedbacks, put those into the simulations and see what happens.
So the core of the model is the earth dividing up into the cubes. So they take the
atmosphere and divide it into a huge number of cubes and take the ocean and divide it
into a number of cubes. And for each cube at each time step in the simulation you solve
the equations of fluid motion. And that's it. That's basically what the climate model is.
Now, of course, then you have to put all sorts of interesting boundary effects in there.
You have to tell this model where the land masses are, where the mountains are, where
there are different kinds of vegetation, where there are sources of greenhouse gases. And
so there are all sorts of parameters that you have to add to the model to make a more and
more realistic simulation of the earth.
But the core essentially is this huge, big computation of fluid flow.
And here's an interesting observation. In the last 20 or so years, there has been
essentially no change in how long it takes to do a run of a climate simulation. And yet -well, wait a minute, wasn't the Moore's law -- don't we get a doubling of processing
power every -- whatever it is, 18 months, which is a dramatic improvement in processing
power every 20 years, and yet it still takes the same amount of time to do a climate run.
What happens is every time they get a faster machine they just increase the resolution.
They increase the resolution of the grid.
So the driving constraint over how long a climate simulation takes -- there's only two
constraints, really. It's how patient a scientist is, is willing to wait for a result, and how
much time can they get on the local supercomputer to run their experiment. And that's it.
And the more time that they have available, well, they'll just up the resolution of the
model because resolution is everything to this community.
So this gives you some sense for some recent models of the size of the grid squares. So
HadCM3, which was the Met Office's main model about seven or eight years ago and the
one that went into the last round of IPCC reports, had grid squares of -- I've forgotten
what those are -- 270 kilometers on a side and 19 levels in the atmosphere.
HadGEM1, which is its replacement, has -- where are we -- 38 levels in the atmosphere
and grid squares 135 kilometers. There's a newer generation of models further than that
that goes up to 78 levels in the atmosphere and the grids squares are getting smaller and
smaller. But that's the kind of resolutions they're playing with.
So you take your climate model, you stick it on the supercomputers, and you set them
running. And it will take typically -- I've got some numbers here -- about 30 minutes of
CPU time to simulate one day of climate.
Which means if you multiply that up a century, so say you're interested in simulating the
entire 20th century, that's about 50 days that you have to run that simulation for on the -you know, some of the fastest machines in the world.
And here's what you get. So this is what I was showing at the beginning, and I'm just
going to quickly -- for those that weren't here at the beginning, just show this run.
Because I find it beautiful and fascinating.
This is a visualization of one month in August of climate showing basically precipitation.
So this just shows you where it was raining. So anywhere that's white is light rain and
where it's orange is heavy rain.
And the importance of sharing this is to understand that what falls out of these models -- I
said they're basically solving fluid flow -- you see real weather patterns emerging. You
see tropical cyclones in Japan. You see the North Atlantic current glowing the rain onto
the U.K. You see the Indian monsoon occurring.
And none of that is programmed into the model. This is all emergent properties of those
basic physical equations for how heat and energy and water and mass are transported
around the plant according to where the land masses are, what gravity does, and so on.
As soon as you see this, you say -- this is what convinces me that these climate models
are incredible scientific instruments, because they can simulate -- this isn't real
observation of data, but it behaves in the same way that the real climate system behaves
such that when we play with the model we can believe that this is as good as it gets to
doing real experiments on a planet-wide scale, which, of course, is what we can't do.
So let me stop that and carry on.
So climate is a very complex system. There's all sorts of sources of uncertainty. There's
measurement error. Our observation of data that we're testing the models against has
errors in it. There's variability in the physical processes. So although we're studying, for
example, global warming, superimposed on that warming are all sorts of other cycles that
completely swamp the warming.
So the short-term cycles, the annual and decadal cycles are much stronger than the
warming signal. So you've got to kind of pick out the signal from the noise.
And then there are of course model imperfections. The models are imperfect. You
cannot simulate everything that you'd want to simulate and have it run in a reasonable
amount of time. So there's all sorts of imperfections in the models.
I'm just giving you an example of some of the tradeoffs they have to make. There are a
huge number of physical processes that we might want to put into a simulation. And this
graph shows you some of them, things -- and of course they have happen on different
spatial scales. So this is a logarithmic scale from the millimeters up to -- where are we -hundreds of thousands of kilometers. And from microseconds up to tens of thousands of
years.
So, for example, things like surface gravity waves, turbulent mixing and so on, cloud
formation, we're interested in fine grains of scales.
Climate changes on a completely different scale. El Nino, seasonal cycles and so on.
So if you're going to do a particular run for a particular scientific experiment, you've got
to decide which of these things matter and therefore which physical processes you want
to put into the model and which you want to leave out. Because you can't ever put them
all in.
So they're continually making these engineering tradeoffs for each experiment. Which
things do I want in the model for this experiment and which don't I want. Knowing that
everything that you leave out effectively means the model is less perfect.
Okay. So that's the context. Let's look at how they build the software.
So the U.K. Met Office is one of the world's leading centers for climate modeling. There
are about 25 labs around the world that build these simulations. Each one has their own
model.
And, by the way, all the models are basically built at government labs. They're not built
in universities. They're built at government labs by government scientists. Because
universities just don't have the resources to do this.
So at the Met Office they have a shared code base with lots of different models. They
call it the unified model. It's a huge hunk of Fortran. It peaked last year at about a
million lines, a million lines of code. I've got some pictures later on to show you how it's
grown. And that unified code base is used to build both weather forecasting models -NWP is numerical weather preaddition.
So the Met Office in the U.K. is an operational weather forecasting center. In fact, they
provide weather forecasting services for about half the planet, for civil aviation, for
military operations, for the media, for all sorts of commercial outlets. They are an
operational weather forecasting center.
And out of the same code base they build the climate models. So a lot of the core
routines, a lot of the numerical routines are the same in a weather forecasting model and
the climate model. What's different is the scales at which they're looking at things.
They have this very -- I call it a hybrid development process. It's very much bottom up.
The scientists themselves decide what's important to work on and what they want to add
to the model. But they also have some top-down management priorities to say there are
some scientific priorities, there are from the operational forecasting side of the house
certain business goals they're trying to satisfy, and those are superimposed upon this
community of scientists who are sitting there saying here's what I want to do with the
code.
And nearly everything they build is in house except they're starting to engage a little bit
in the idea of a community model. So this is a community of scientists across the world,
or in some cases across a particular country, who are working together to build a
particular module.
So one example of that, for example, is the U.K. Atmospheric Chemistry Group who
have built a model of atmospheric chemistry, and they're trying to incorporate that into
the Met Office's model. So that's code from outside the lab being incorporated.
So the Met Office has about 50 different software development projects in house at the
moment. Of those, the unified model, which is their core simulation model, that's the
biggest project they have going. It's currently -- I said it peaked at about a million. It's
down to about 850 source lines of code.
Of that, nearly a third has changed in the last two years. So the amount of code churn
here is phenomenal. And on the climate side of the house, there's about 170 scientists; on
the weather prediction side of the house there's about 300 scientists, scientists with
background in meteorology, numerical analysis, climatology, and atmospheric chemistry,
oceanography, a few other things. And supported by small teams of IT specialists. And
I'll talk about their role in a few minutes.
All of the code is built by the scientists. So virtually every line of code originated from
one of those scientists.
With each release -- they do a new release of the model about every four months, each
release has about a hundred people who have contributed code to that particular release.
So here's my graph of the code growth. The green line is lines of code -- lines of Fortran
on this scale here. So you can see a million at the top there, and that's where it peaked
with version 6.6.
The blue line is the number of files on the right-hand scale, so about 3,000 modules
currently. And what's interesting about this curve here to me is -- I said I was looking for
things that surprise me. It's approximately linear over a 15-year period. I can't get
records back older than 15 years. I trolled through their archives. And so over a 15-year
period they've seen an approximately linear steady growth of that code base.
With -- there's two little perturbations here, and those are easily explained. You can ask
them what happened. Here they stripped out the dynamical core of the model, the basic
core. Numerical methods were considered old fashioned. They stripped them out and
replaced them with a new core. And of course that basically stopped all other work in the
lab for nearly a year while they got those working.
And in hindsight they say we tried to change too many things at once. And so when
we're up again for replacing the core, we don't do it all at once again. So that was that
glitch there. It took them till version 5.1 to get everything working again.
And this little glitch here, they stripped out the old ocean model. They were using an
ocean model from GFDL in Princeton. And they threw that away and use add new ocean
model from a group in Paris. And the new ocean model is a lot more compact. So you
get a drop in code there.
So apart from those two glitches, it's a linear growth over a long period of time.
What's driving that change -- I said it was very bottom up. Here's what happens. There
is -- the three colored blobs are basically the three drivers of change to this code base. So
one is physics research. So new insights, new data, new papers published about the
physical properties that they're trying to simulate in the model. So they're incorporating
new research into the model.
And the other two blobs are day after day after day after day running the models,
comparing them with observational datasets, comparing them with other people's models,
and feeding back improvements into the models. So both the weather prediction side of
the house and the climate modeling side of the house are doing this every day.
And, in fact, that's where most of the changes come from. They come from this
continually running the model, playing with it, and trying to improve it.
I've drawn these smaller. There's an occasional attempt to clean up the code. Actually,
it's not to scale. It should be a tiny little cloud. You probably wouldn't even see it on my
slide if I drew this to scale.
There is very little code cleanup that goes on. And there's some good reasons for that.
One is -- I talked about reproducibility, the ability to reproduce an experiment. If you
clean up the code and you mess around with exactly what phenomena emerged from the
model when you run it over a long period of time, you've broken the ability to run an old
experiment again.
They have this constraint -- and I'll talk about it a little bit more later -- known as bit
reproducibility. So every change that you make to the model should preserve bit
reproducibility. What that means is you run the old model, you run the new model, and
the outputs should be identical down to the most significant bit.
And of course these are all real numbered variables we're playing with. So down to the
most significant bit in a double-precision real number. The simulations have to be
identical.
If you break that, that's a big thing. There's this list on their Web site of changes that
break bit reproducibility. And they'll spend ages tracking down them to say can we get
rid of that.
Now, there's two reasons why bit reproducibility matters. One is because you want to
rerun an experiment and get exactly the same results again. But the other is that bit
reproducibility gives them a free ride when it comes to testing.
So if you think -- if it takes a month to run a century's worth of climate simulation, that
means testing this thing is hell. You have to wait a month for your test.
So what they do instead is they'll run it for let's say one day of simulation or even one
hour of simulation. And if down at the very least significant bit it's identical, that's a
good indicator that if they let it run for the whole century, they'd get the same result. It's
not guaranteed, but it's a damn good proxy for letting it run for the whole month.
And so they can run a very large number of experiments -- overnight unit tests, for
example -- test bit reproducibility, and if none of the bit reproducibility tests are broken,
that's a good indication that nothing did get broken.
You have a question?
>>: Yes, if you don't mind.
>> Steve Easterbrook: Yeah.
>>: So when it comes to something like testing, when you've been saying "the model" -but there isn't one model, right? There's a huge family of related models depending on
what parameters you use or whatever. So how do they know that even if they preserved
one they haven't broken others?
>> Steve Easterbrook: Okay. I'll answer that I think in the next few slides. And if I
don't, ask the question again. Because it's important.
So some of the requirements for this model conflict. For example, from the weather
forecasting side of the house, the simulations must be very fast and give accurate
forecasts, and in fact they have a very precise set of metrics for forecast accuracy, and a
business goal of improving accuracy by a certain percentage every year. So accurate
forecasts are the most important thing for those people.
For the climate scientists, scientific validity is more important. They will accept a run
that's less faithful to the observational dataset as long as it's better physics.
So of course what you get is the weather forecasting people are always tweaking the
model. They're always trying to tune it to give very, very good simulations of current
weather situations.
The climate modelers would prefer it didn't get tuned quite so much. They don't care so
much about faithfulness; they care about is the physics in the model defensible, is this our
understanding of what's happening.
There's also tensions between components in the model that have different origins. So
the stuff that's developed in house, which is most of the code, is very tightly controlled.
The stuff that's contributed from outside is a real problem for them. It can take a good six
months' worth of work by several people to take an external chunk of code and integrate
it with this model, get it to work on their in-house platform.
So naturalizing external code is a big problem for them. And they would prefer that
didn't happen, but that's the way it is.
So the basic toolset is everything is -- all the code is controlled through Subversion.
Everybody works on a branch. They use Trac as their bug tracker. They've got graphical
code diff that they use quite a lot for playing around with differences.
They've built a custom user interface to Subversion and Trac to greatly simplify. They
don't have to know really much about how Subversion works. There's only two or three
Subversion commands they ever use, and they've simplified the interface to those. So the
scientists don't have to understand Subversion that well.
They've got a custom build and a custom code extract system that when -- want to build a
particular operational model it goes and gets all the physical schemes that have been
chosen into that model, does the build automatically, sends it to the supercomputer and
starts the run. So all of that is at a push of a button.
So coordinating these big teams is a big challenge. So they're all working on branches of
Subversion. So here's how it works essentially. Each of the operational models, and
there's about eight or nine operational models in house built from this code base, each is a
separate branch in the Subversion repository. Each has its own release -- no, sorry, I take
that back.
Yeah. Sorry. Each is on a branch in the Subversion repository and has its own release
schedule. And the core code base itself, the unified model, has its own release schedule.
You saw on my graph of codes each of the dotted lines was an official release.
One of the big problems they have is how often should they update their branch from
other changes, how often should they incorporate changes that other people are doing into
the branch I'm working on. If you do it too often, then, you know, you have to stop what
you're doing and spend time getting everything working again. If you use it too rarely,
you get too big an increment and everything breaks.
And just being aware of what changes are happening elsewhere in the lab. Across a team
of several hundred scientists, it's impossible to know what other people are working on,
and yet they desperately need to know.
So they have a very heavy reliance on informal communications. So they solve most of
their problems just by knowing who to go and talk to. And the interesting thing is
these -- so there's several hundred scientists. They're essentially in one big, open plant
office. They occupy the entire floor of the building. And you can get pretty much from
any desk to any other without passing through any door or staircase. It is a huge, big,
open plant office.
And they all say that's important to them. About five years ago, six years ago, they
moved to a new site in a different town. And previously there were -- the numerical
weather prediction and the climate research were in different buildings with a car park in
between. And they always say, oh, it's much better now; we don't have to walk across the
car park. So that's clearly important to them.
Oh, yeah. So I mentioned that. They use extensive use of wikis and newsgroups within
the lab to keep informed of what's going on. Lots of e-mail.
The other thing they do a lot is they form temporary cross-functional teams. So they've
got a particular phenomena going on in the model, and the one that I saw when I was
there is a big problem with the Indian monsoon is too dry.
So they put together a team made up of different scientists from different disciplines
across the lab who will get together regularly over a six-month period and figure out what
they need to do to fix this.
And this is one of the ways that they learn who else is in the lab and what they know. By
participating in these teams on a regular basis, they all get to know who else is there and
what other people know. So that's how they maintain the social network.
They also are organized a little bit like a typical open source project with -- you've seen
studies of open source, they'll talk about the onion model where you have a project leader
and then you have a core set of members who decide what goes into a release, and then
larger groups of code contributors who are less heavily engaged in the team are the
people who are just users, people who report bugs and so on. So there's impassive users.
So there's a core set of people here who control the project.
And you don't get to be in the core until you've proven yourself. It's a meritocracy. So
the core changes relatively slowly and people are only accepted into the core of an open
source community when they've proven that they can -- they've got what it takes.
Well, that's how these guys organize themselves. They've got -- at the core they have
those systems teams. And the systems teams look after the releases.
So they'll plan a release on about a four-month schedule. And they'll take code
contributions from scientists anywhere across the lab to say you can submit your change
to this upcoming release. But it has to be ready by a certain date and there's a cutoff
about a month before the expected release where they freeze the code.
Actually they talk about a frosting rather than freezing, because they still allow some
changes.
And from that point on the systems team here work like crazy taking no more than three
changes each day, incorporating them into the trunk, running the overnight test harness,
and if everything worked, then they move onto the next set of changes for the next day.
And then outside of them are a set of code owners. So there's about 20 senior scientists
across the lab, each one assigned a particular chunk of code that they are the expert and
they have to approve all changes to that chunk of code.
And they have two stages of review. Before the change is actually made, particularly for
the large change, they have to prove this is even a reasonable thing to be working on.
And then once it's ready, they have to approve that it's now okay to accept into an
upcoming release.
They also have this -- I've drawn it kind of slim. They have this set of configuration
managers, one per operational model. So this is typically a scientist who ends up
spending about 50 percent of their time not doing science but worrying about how
changes are affecting one of those operational models.
So this is the answer to your question. They have these designated people whose job it is
to look after a particular configuration of the model.
And then they have, as I said, in any particular release about a hundred people who are
contributing code to that release. A large number of scientists, end users. And I didn't
draw it. There's another ring here. Another group of people who are preparing changes
for future releases. So just playing around with stuff.
If you look at the repository, you can see exactly what I described here. These are the
top -- where are we -- however many changes to the trunk of the UM over some period of
time. I can't remember what the time period is.
So there's only 14 people that contributed code to the trunk. And here, notice this. If you
know what the usernames are, the top one, two, three, four, five, six, seven, all their
usernames start with FR. That's the systems team for the weather prediction side of the
house.
The four usernames starting here with the "had" are the systems team for the climate
modeling side of the house.
Basically the only people putting code into the trunk are the systems team. But they
didn't write that code. That code all came from scientists who were contributing code for
upcoming releases.
And notice that the weather prediction people completely dominate the climate change
people in terms of numbers of code and amount of code that's being changed.
So how do they do V and V in this environment? Lots of informal desk checking. They
don't have any formal unit testing process. There's no requirement to do unit testing, and
most of them just don't bother.
There are these two stages of review that I mentioned. But here's where nearly all the V
and V comes. It's a continuous testing set up as science experiments. I talked about bit
reproducibility, and I talked about the automated overnight test harness on the main
trunk.
So this one here is the one I haven't talked about yet. And this is the one that I find
deeply fascinating and unusual.
So here's what you do if you're making a change to your branch. You set your change up
as a science experiment. You've got a hypothesis that you're testing. If I change this
little bit of code here, I think I can make an improvement to the model that will have the
following effect.
And then you test that hypothesis by running the new code, using an old run of the model
just before you changed it as your control, so now I've got a scientific experiment, went
to the two treatments, the control and the new code, and your observational data as your
measurement of how well you did.
And you push a button. When you submit the code, you push a button and it generates all
these visualizations of the results.
So what you see here is for a different model parameter. And, by the way, this is just the
top portion of a wiki page that goes on. There's about 30 different parameters that they've
picked out here that they want to see visualizations of. And each one is a four-up display.
And I should explain what the four-up display is.
So you see four different visualizations of this one model variable. The first is -- so in
this case what they're doing is they're experimenting with a new polar filter over the
Antarctic. So something that's supposed to improve sea surface temperatures over the
Antarctic.
So this is PMSL, which is mean pressure at sea level, for DJF, December, January, and
February. So the winter in the northern hemisphere, winter quarter of the year.
That's the raw result from the new model, from the changed code. This is the difference.
This is the delta between the control, the old version of the code and the new version of
the code.
So this is now where in the world did we make a difference. And the differences are
where they're expected to be. They're around the Antarctic. The biggest difference is
there. But there's this whole band. The differences are where they're supposed to be.
That's good.
This one here is the control minus the observational dataset. So this is how well was the
old model simulating on observed data for this period. So these are anomalies. This is
where -- anywhere that's darkly shaded here, this is where the model was doing badly
before.
This is the new code minus the observational dataset. So this is how badly the new
model doing. And there's less dark blue. And there's less dark blue where there's
supposed to be less dark blue. So this was a successful experiment.
By the way, success here is done by eyeballing these graphs. That's the criteria. So they
look at them and they stare at them and they pass their -- I sat in these meetings where
they've walked in with their latest pictures like this and they sit down and there's not a
word of explanation, four or five people in the room handed these graphs. This is the
first time they're seeing them. And they say, oh, yeah, you did it, you got what you were
looking for. And they just go straight in. Nobody has to explain what they're seeing.
Okay. Because these visualizations are their common currency. This is what they spend
all their time staring at.
They also -- some of them. This is not a universal practice across the lab. But some of
them use their wikis as electronic lab notebooks. So this is one of those configuration
managers who's keeping track of every experiment. So every five-letter acronym here is
an experiment. It was a change to the code where one of those experiments was run
comparing the old version to the new.
And in the middle here he's written down a brief, few-word summary of what the change
was. And notice that he's got several different models he's trying it in.
So N96L38 is a particular resolution of the model with 38 levels in the atmosphere, and I
forget what 96 means in terms of horizontal resolution. But a particular horizontal
resolution.
N96L70 is a higher resolution model. And N144L38 is another different resolution. So
some of his experiments are in one resolution model, some are in another. And one of
the things he'll do is he'll try it out in one model; if that works, he'll try it out in a different
model.
So, again, in answer to your question, he's going around systematically looking does the
change work, first of all, in one model, and then how does it affect models at different
resolutions with different physical properties.
I think everything in red here was an experiment that failed. I better go back and check
this. I think everything that's in black was a successful experiment; everything that was
in red it failed, it didn't do what it was supposed to do. And the one that's in green I
haven't figured out. I don't know what that is. Anyway...
And then single one of those experiments -- of course the reason they're blue is that
they're hyperlinks in the wiki -- leads to a lab notebook for that experiment: who did it,
what it was, what it was supposed to show, links to the visualizations of the output, brief
description of the results and so on. So he's got an electronic record of every experiment.
Here's another visualization that they commonly use for getting a bigger picture of how
well they're doing as they're improving the model.
So this graph here summarizes -- there's about 30-odd -- more than 30 core indicators of
model skill, each one typically expressed as a root mean squared error over an
observational dataset. So when you get down to zero, you match the observation data
perfectly.
And what they've done is they've taken all of those variables and normalized them so that
for each variable one is where the old model was. So that line there is -- this is what the
old model did. The colored dots are what the new model did.
So you've got a one-shot visualization of where are we getting worse and where are we
getting better. If you're above the line, you're doing worse; if you're below the line,
you're doing better.
And the whiskers are the error range in the observational data. So if you're within the
whisker, you're now within your target of where you want to be.
So everything in the new model that's within the observational whisker is green. That's
good. We got where we want to be for those variables. Everything that's red is a variable
we're doing worse on in the old model. Everything that's amber is where we're doing
better than the old model but we still haven't met our target range against the
observational data.
They also spend a lot of time doing model intercomparisons. So comparing their models
on standard scenarios with other people around the world. And there's a huge amount of
effort that goes into this.
They'll also do model ensembles. They'll take large numbers of different models and run
them over and over and over again to do probabilistic forecasts. And that's how a lot of
the forecasting, when they do forecasting, that's how it's done.
So let me try and -- that's what goes on in this lab. Let me try and summarize some of
what's happening.
First of all, an observation that actually I found surprising after trolling through the
literature, this approximately linear growth in their code is unusual over such a sustained
period of time.
If you go and look in the literature of people who have done these long-term studies of
code growth, and the classic ones were all done by Lehman on commercial systems, he
points out that all the commercial systems he studied have this approximately inverse
square curve.
As the code grows -- as it grows, as it gets bigger, growth tails off because of the growth
of complexity. It just gets harder and harder to change the model as it gets bigger or
change the code as it gets bigger. So somehow these scientists have escaped from that
trap.
Studies of open source, this is from Mike Godfrey's study of the Linux kernel, do tend to
have this linear growth. In fact, he showed that the kernel itself is approximately linear
over some long period of time. And it's super linear if you take into account all the
device drivers that are being added. Well, let's leave out device drivers. That just
complicates the picture. It's approximately linear if you just look at the kernel itself.
Yeah.
>>: [inaudible] how many projects in the world would you say there are with more than
a million lines of code? Is this common now or are there like ten in the whole world that
actually are that big?
>> Steve Easterbrook: That's a great question. And I don't know. Anybody else know?
>>: I'm just trying to say, though, that this project you're talking about in the context of
[inaudible].
>> Steve Easterbrook: Yeah, okay. I know it's big, but I don't know how many other
projects it compares to. That's a very good question.
Right. So one observation here is that it appears to have escaped the trap that lots of
commercial code falls into, that growth tails off after a while because of the complexity.
And so now I have to explain that. If it's more like open source [inaudible] what's
common about open source projects -- some open source projects and this scientific code
that's different from commercial systems. And my best hypothesis for that right now is
the domain experts are writing the code. And it might just be as simple as that.
In most commercial systems, the domain [inaudible] building financial software. The
domain experts don't write code. They have to explain to the programmers what's
needed. And then there's this big communication gap.
So one hypothesis is that in a lot of open source and in a lot of the scientific code you
don't get that communication gap, so you escape the complexity trap.
I don't know. It's a hypothesis that needs testing. We have nowhere near enough data on
code growth in different types of projects. So there just aren't enough data points to be
sure even the phenomena that Lehman pointed out is really true.
What about defect rates? I attempted to do a back-of-the-envelope calculation of the
defect density comparing to some stuff that's been described in the literature.
So NASA space shuttle is usually held up in the literature as the best ever and the most
expensive per line of code every built in the world. And they report about .1 failure per
thousand lines of code post release.
Well, in the unified model, let's say I take the last six releases. Over the last six releases,
the average is about 24, and it actually is a very small variability. It's somewhere
between 20 and 30 bugs per release with an average of 24. And an average of 50,000
lines edited per release.
So what does that mean? That means about two defects per thousand lines of code are
making it through their release process undetected.
Or if I expand that to an expected defect density of the entire code base, I get a number
0.03 faults per thousand lines of code in that code base right now.
Now, of course, this depends on what you count as a bug. You know, how did I count
defects. Well, I counted what was reported as errors in their bug tracking system. Do
they record all errors in bug tracking systems the same way that NASA would or military
systems would or Microsoft would.
Okay. I've got my grad student doing a follow-up study on this to try and get better
numbers, because, first of all, we're not sure if this is believable. If it is believable, it's
remarkable.
So, first of all, we better check these number actually are even within orders of
magnitude anywhere near accurate. And they appear to be.
And so the next question is, well, how do I explain that, how do I explain this relatively
low defect density compared to other types of software.
And, you know, we played around with, well, why don't they seem to have many errors.
Well, in large-scale numerical simulations, most of the coding errors that you could make
will be instantly obvious for a certain number of reasons. First of all, the model just
won't run. Or, secondly, it does run but it crashes pretty soon because some variable's
just gone out of range and the simulation's just gone haywire. And you only have to run
it once to spot that. And of course they're running every change lots and lots of times.
And then they've got these bit comparison tests to make sure that you didn't break
anything.
So it's a very, very conservative change process that doesn't let many bugs into the
released code.
And then we spend a little bit of time probing some of the bugs that did make it into the
release code to find out what happened. And let me tell you one story which I found
fascinating.
There was an error that had been in the code for a couple of years in the released versions
of the UM before it was fixed. And it was an error in the soil hydrology module, so how
much moisture is there in the soil, which of course affects how much moisture is passed
into plants which affects the evaporation of moisture into the atmosphere which affects
where the water is in the system.
The reason the error was in the code was because the routine in this soil hydrology
module was taken from a published paper that had done a study of soil hydrology, and
they'd got some parameters from this published paper and a couple of equations from the
published paper in this code.
And they'd mistakenly taken a logarithm in the paper and assumed it was a natural
logarithm when it was actually a logda base 10 [phonetic]. So they were off by -- what's
the conversion factor -- 2 point something in this equation. And so the soil was -- I don't
even remember which way it is now. It's too dry.
And they'd been aware that they were having problems with moisture in the model
around the soil implantive operation for a period of time. So they kind of knew that
something wasn't right, and they just didn't have time to fix it. It wasn't causing a big
enough change in the moisture elsewhere in the model that it became urgent to fix.
So for two years they kind of new something was wrong, but they didn't know what. And
they just tuned that out. They said, well, we'll just add some extra tuning into the model
to compensate for that problem. We know it's there. One day we'll get around to fixing
it. But we can run over all our science experiments without fixing it because we've got
all sorts of other inaccuracies in the model and so it doesn't matter.
And then one day a scientist in the lab had a bit of spare time. He said I'm going to track
that bug down, we're going to find out what happened.
And the way he tracked it down was he went and got five or six other models from other
labs, looked at their soil hydrology modules, ran their simulations and compared those to
the Met Office's and discovered that in five out of the six models that he found they all
agreed with the Met Office model, but one model was noticeably different.
And so he then tried to track down where the difference was, and he pinned it down to
this particular equation in the model. And lo and behold the one that was different had
got the right logarithm in the model and the other five had all got the wrong logarithm in
the model. And, you know, they'd all talked about this in shared code and shared -- and
so they had propagated across all these different models.
So they found it by comparing. So the fact that there are a whole bunch of other people
around the world building the same software for essentially the same purpose gives them
a huge advantage.
It's not clear -- I chatted to this guy at length about how he found the bug. And it's not
clear he would've ever found it if he didn't have the ability to go and grab other people's
models and compare them.
Okay. So that allows me to -- let me start to wrap up and draw some conclusions.
I -- having spent this time at this lab and looked at what they do, I characterize it as an
extremely successful software development outfit. They build very high-quality code.
They are very careful about which tools they use, but they know that the tools that they
use are essential and they're resistant to trying out other stuff just because it's fashionable.
And they've managed to, you know, have this linear growth in code. There's no tail-off in
the growth of the functionality of their code over a very long period of time.
So what matters. It matters that they have a highly tailored software development
process. They don't do what the textbooks say you should do; they do what works for
them.
And over a very large number of years, a large group of very smart -- they're all physics
Ph.D.s. They're smart people. And they spend time thinking about how to make
improvements to their process.
So over many, many years they've tinkered with the process and they've evolved
something that's highly adapted to their environment, their context; not by listening to the
literature, but by playing around with stuff. And if it works, they adopt it; and if it
doesn't work, they don't use it.
As I said, the code developers are domain experts. And that appears to be crucial.
They have this very strong sense of shared ownership. So they're like a small software
startup in many respects, like an agile software development team. They all own this
code. And they all feel a very strong sense of ownership and a very strong sense that
we're all responsible for making sure it works.
So they have also that idea from open source of many eyes validation. There's many
people looking at this code and worrying about it. So the chances are people will look at
the code and find the errors. So they also have this openness. You know, the code is -it's not officially open source, it's not given away freely. You have to sign a license with
them if you want it. They'll give it to any other research outfit for free if they sign the
license.
And but pretty much the only people that use it are in house. There are a small number of
communities outside the Met Office that run this model, but the biggest group of users
are in house.
They do a lot of benchmarking work, these model-into-comparison projects; that they
spend a lot of time running simulations over everybody's different models and
comparing them.
And -- oh, one thing I didn't mention is they don't have a fixed release schedule. So
although I said they plan a four-month release cycle, they don't decide a release date until
they're ready.
So they have a target in mind. And the systems team are running their overnight tests
every day until they've got every change folded in. But they don't announce a release
date pretty much till the day of the release. They say today we're done. We've got all the
changes folded in, nothing broke, we can announce a release.
And they won't release until it is ready. So -- because there's no external customers using
this, they can do that. It doesn't matter when the release happens. The release matters for
stability, but it doesn't matter because there's no one waiting there to buy it or to upgrade
or anything. And people are relatively slow to upgrade to newer versions of the model.
So highly adapted processes. They use all sorts of bits of agile practice, or things that I
would call agile practice. They don't use this terminology at all. So the ones that are
checked in green here are things that I think they use. The ones that are in red are the
ones that they don't appear to do anything like this. And the ones in yellow are ones
where they couldn't decide. They kind of sort of do it but, you know, it's not clear.
So it's interesting. They've picked and chosen bits of agile practice that work for them.
But they certainly don't do it dear to any standard agile model.
They have a very strong sense of a shared conceptual architecture. They all know their
way around the code. They all know which bit of the code corresponds to which physical
process. And they've all got pictures like this in their heads of what the main units are
and how they interact.
And then there's always three interesting comparisons with open source. This release
schedule that's not constrained by commercial pressures, developers being domain
experts is, again, common of the best open source projects. A core group of code owners
who very tightly control what gets accepted into the trunk.
A community that operates as a meritocracy. So the people that are best able to code do
the coding; the people that are best able to look at the scientific direction of the model
look after the scientific direction of the model and so on. And those groups are relatively
stable and change very rarely.
Oh, and here's my favorite observation. None of the people that write the code for this
model think of themselves as programmers. They're not programmers. They all are
scientists doing scientific research. The only reason they ever build code is because they
need it for the research. So it's like an open source developer who has a day job but
happens to tinker occasionally with tools because they want the tools to do whatever they
need in their day job.
So they're not programmers, they're scientists who just occasionally have to write code.
And then the verification and validation is based on extensive use by the developers
themselves.
Challenges. Let me pick out a few things where they are having serious problem. I
mentioned coordination. They do have a big problem with coordinating changes across
all the different branches in the lab and just knowing what else is going on and knowing
who else is making a change that somebody ought to fold in with something that they're
doing.
They really want to get into multisite development. The multisite development thing is
important because these models are now getting so complex with so many different
scientific modules from different disciplines. Like as soon as you put in plant biology
and soil hydrology and oceanography, and you can't have all that expertise in one lab
anymore. You used to be able to. They were getting to the point where they just can't.
So they want to have multisite development where some of the modules are built
elsewhere where there's expertise and imported into this model. And every time they've
done that, they've got into huge difficulty.
There was a big fight going on in the lab when I was there last summer over the new
ocean model. So the new ocean model is built by a group in Paris. And one of the things
they did with the old ocean model, which came from Princeton about 15 years ago, was
basically when they took that old ocean model and put it into the UM, they did a code
fork. They basically had to start making their own changes to make that ocean model
work in their code.
And of course as soon as they did that, they forked from the old ocean model. Which
means they got a state-of-the-art ocean model, but they lost the connection with the
original team so they couldn't get all the updates to it as it got steadily better. So they've
now got an ocean model that just wasn't keeping up with the science.
So when they went to NEMO, which is this new model from Paris, they said we're not
going to do that, we're not going to fork. We're going to write an agreement with the
Paris folks that all the changes that we need to make the ocean model work in our model
will be folded into the baseline in Paris and looked after by the team in Paris.
And so now they're going to avoid forking, but they're buying themselves a whole bunch
of other coordination problems, because the team in Paris is basically four people. And
they have very strong -- the folks in Paris have very strong notions about where they're
taking their model in the future and which changes they should accept and which they
shouldn't.
And one of their overriding design principles is portability. They want their ocean model
to work everywhere and everyone's different hardware with all sorts of different models.
The Hadley folks want it to work very well in the Hadley model. So they're pushing
changes on the group in Paris saying you've got to change this in your model to make it
work in our architecture. And the folks in Paris are saying we can't make that change
because that breaks portability with everybody else's models.
So now they're getting into all these fist -- not fist fights, but they're getting into these
fights over which changes need to be made to this ocean model and who's going to be
responsible for those changes without forking.
And that's a big problem for them.
I think I might wrap up and -- let me say one more thing about future challenges. And
this crops up in the literature in climate science quite often. There's this long-standing
desire in the community to build plug-and-play models, to have each of the different
physical modules be plug and play.
You know, I take the ocean model from Paris and I just plug it in and have it work. And
that depends upon, you know, having an appropriate shared architecture, having
well-defined interfaces, having couplers that couple the different physical routines and do
all the scaling across resolutions and all the boundaries, and all sorts of crazy things have
to happen to make these modules work together.
And for 20 years they've been talking about this in the literature: We're working towards
plug and play, we're making progress on the whole plug-and-play thing.
And I've sat down with several lead climate scientists over the last year and said, you
know, this isn't working out. You're really not making any progress on this, are you.
What's going on.
And they've all said quite frankly it isn't going to happen and it isn't going to happen
because of the core complexity of the physics. The domain is inherently tightly coupled.
And it's tightly coupled in a way that it is impossible to separate out the different
modules, have different people build them and have them work together without a hell of
a lot of work.
So integrating somebody else's ocean model into your atmospheric model, for example,
requires lots of deep changes in both models to make them work. And they're now
saying you know what, it's finally time that we accepted that and said this notion of a nice
modular architecture just will not work in this domain. The physics of it just won't allow
it.
Now, I don't know if that's true or not, and not a way of validating that as an observation.
Is it impossible for them to have the kind of modularity that we'd expect to see in good
software. Do the physics prevent this. We don't know, and I think that's a great research
project to undertake to find out if this is even possible.
Let me talk about where next. So the couple of things that we're doing as immediate
follow-ups to that study, one, as I said, one of my grad students is doing a much more
detailed study of defect density. And I'll show you some of his preliminary results in a
second. We want to replicate this study at other modeling centers to see are these guys
unique or do other modeling centers do it differently.
There's one CCSM, which is built at NCAR in Colorado, which is actually a community
effort. It's the only one in the world that isn't an in-house, one-site development team.
It's a community model.
And so I want to go there and find out, first of all, is it really a community model. And
what we suspect from our initial conversations with them is that it isn't at all. It's an
in-house model in which there's a core team spending a lot of time taking code
contributions from this open source community around North America and spending ages
reimplementing them to get them into this model. We think that's what's happening. But
I need to go and validate that.
And I want to compare the V and V that goes on with these models with other kinds of
simulation models. For example, economics models used in climate policy or other
environmental science policies. Because we also think that the core approach to V and V
that goes on here is kind of unique amongst scientific code.
And one of the reasons I think it's unique is they're leveraging off a huge effort by tens of
thousands of people around the world collecting meteorological data and validating that
data. And it's hard to think of any other scientific discipline that has the level of activity
going on in collecting and validating observational data against which to test the models.
It just doesn't happen in many scientific disciplines.
So that matters a lot to them. It matters that they've got huge volumes of observational
data and it matters that a very large number of people are using that data for all sorts of
things to validate it.
And it also matters that there's lots of teams around the world simultaneously trying to
solve the same problem that they can all compare against one another. And that, again, I
think is unusual in scientific disciplines.
So my grad student, John Pipitone, you can actually see his work. He runs a lovely blog
in which he describes his progress on this. He's taken three different models. One of
these is the Hadley model. I didn't check which one it is. I think it's the bottom one.
But, anyway, it's either B or C. Anyway, he's measured the defect density of these
models by trolling through their bug databases.
And the scale along the bottom here of defect density, two defects per thousand lines of
code, this is from Norman Fenton's book about code defects where he classifies anything
less than two is good, from two to six is average, and then above that is poor.
So by those standard code metrics in the literature, all three of these models are good
quality code from a software defect point of view.
But he showed this to several software engineers to say, okay, is the code good quality,
and they'll say, well, you know what, these numbers don't actually tell me anything
because it still boils down to what you call a bug.
And because of this attitude these scientists have, which is appropriate for their domain,
that many things that might in other domains be thought of as defects in the code are just
acceptable imperfections, maybe we're just not counting defects enough.
He's doing another study with static analyzers to troll through to find static analysis
problems in their Fortran and see how that compares.
And he's also gone around and interviewed a whole bunch of climate scientists to talk
about this issue of bug versus feature, when does a bug become a problem or not. And
he's teased out lots of factors, factors to do with momentum. Stopping to fix this bug will
slow down the science, it will break repeatability, it will break a whole bunch of things.
We're going to have to go back and do a whole bunch of experiments again, so we just
accept the imperfection and call it a feature.
The design of the model and funding issues matter. And then operation. How the bug
affects various operational factors matters.
And these different factors come into play at different points in the process. Whether
we're in development stage or scientific experimentation stage matters. It matters when
the bug was found.
What time should I wrap up?
>> Rob DeLine: Anytime.
>> Steve Easterbrook: Let me say one more thing, and that is just to put this study into a
broader context. So where I started on this was this question of can we help climate
scientists. And the reason I was asking that question was because I wanted to ask a
broader question of what should computer scientists in general be doing in response to
this whole issue of climate change. What do we bring to the table that might help given
that this is a societal grand challenge.
And having studied this group with the follow-up studies that we're planning to do, we've
started to come to the conclusion that that's not where the computer science as a
discipline can have the biggest impact.
And let me, then, just in two or three slides try to explain that, and then I'll stop and we'll
do questions.
So here's the system. Okay. We have emissions of greenhouse gases leading to changes
in concentration in the atmosphere which affect the climatology of the planet which have
downstream effects on a whole bunch of other things -- marine biology, agronomy,
ecology and so on -- which cause a number of impacts from climate change. There's all
sorts of change in the geochemistry which lead to other changes, and there's feedback
effects, and there's all sorts of physical things going on in a very, very complex coupled
system.
And all the scientists that I've talked to are -- you know, they're working in some of these
squares, they're working in climatology or one of the closely related disciplines.
This is very, very complex science. What they have to understand are very, very
complex physical processes. But nearly all of them say you know what, this science is
done. We understand the climate system well enough that this isn't the important
question that society has to face anymore.
The important question is the other part of the system, the part of the system where
impacts lead to changes in public opinion, to discussions in the media which feed into
policy, which is affected by industrial lobbyists, where policy affects what goes on in the
economy and what's happening in the economy drives the emissions and so on.
And all of these purple blobs are very poorly understood. It's like this part of the system
we know really well, this part of the system we really haven't a clue about, and this part
of the system down here is the stuff that matters. This is what we have to fix.
Because the argument is that humanity has stepped in and taken control of the planet
where for millions of years the planet had a whole bunch of natural control systems on
the climate and natural feedbacks that just kept the climate stable. Every so often it'd
flip, like from a glacial state to an unglacial state.
But pretty much in any of those periods it was stable because of a number of feedbacks
that kept this a stable system.
We've now perturbed this system so much that whether we like it or not humans are now
managing the planet and we don't know how to do it. And the reason we don't know how
to do it is because we're getting all this stuff down here wrong.
The discussions in the media are just ill informed. There are too many people injecting
misinformation in the media. There are too few people that just don't understand the
physics of what's going on, so they don't understand the urgency of the problem, the time
scales in which we have to do things and so on.
So our observation is actually computer science has a huge role to play down here in just
understanding these systems and building tools and visualizations to make people just
aware of what this system is and how it operates and what we know and how we know it
from the various different physical sciences.
So our observation is, you know, computer scientists have to step up to the plate, and we
as a discipline ought to be able to respond in a systematic way in saying here's what
computer sciences should do.
So there's these beautiful reports. I mean, this is an IPCC report on the physical science.
It's a huge, long report. And it summarizes everything we know about the physics of
climate change.
There's the Stern report which is a beautiful analysis from an economic point of view.
There is an APA report on the psychology of climate change, how is this affecting
people's ability to understand the future, to understand how they're going to adapt their
lives to behavioral changes that have to happen if we're going to change the way we live
our lives.
So there's a huge amount of psychology research has to go on.
Sociologists have put together a report saying this is a sociological problem of how
people come to understand the social epistemology of climate change, how do people
come to understand what they understand about climate. And how can we fix that.
And so I said, well, where's the computer science version of this, where's computing as a
discipline stepping up to the plate and saying what should we be doing. And the closest
that I could find was this, which I thought was a very disappointing response.
So we're starting to map out what we think is an appropriate disciplinary response by
computer science as a whole to the challenge of climate change. And we've held a series
of workshops. There was an initial workshop at ICSE last year. We did a workshop at
OOPSLA in Orlando last month just trying to map out the agenda.
And this is my latest map of where computer science can make a difference. So the green
IT, green software, energy-saving devices, making everything that's controlled by
software as energy efficient as possible and build that into the software.
The study I described today fits into here. I'm kind of calling it computer-supported
collaborative science. How can we help the scientists do what they do better using good
e-science tools. So e-science is the other label for that box.
And then this is the box that interests me the most. Maybe I should put some detail in
here to show you what I have in mind. Software for global collective decision-making.
This is the thing that humanity currently does disastrously badly right now. Global
collective decision-making. And that involves getting the information to the
decision-makers at the point that they need it in a form that's useful. And maybe I should
just leave that as that's it. That's the problem.
And that involves lots of interesting tools, visualizations, information systems, access to
datasets, open collaborative approaches to building decision support tools.
Anyway, you've got the idea. So I should stop there and take questions.
>>: How did [inaudible] Fortran [inaudible] 77 [inaudible] same level [inaudible].
>> Steve Easterbrook: Yep, yep. And there's still a lot of Fortran 77 code in their code
base. There's been a systematic attempt to bring it up to Fortran 95 which had modules.
And, yeah, this is a problem. They solved this the same way they solved all the other
problems, lots and lots of informal communication across the lab.
Yeah. I probably should have a more detailed answer for you. There's been a systematic
attempt to use code modules when they became available in -- I think it was -- what was
it? Fortran 89? Or was it 90? I don't remember which one it was when they brought in
code modules. So they're doing some of that now. And they want to be doing more.
>> Rob DeLine: Any more questions? All right. Let's thank Steve.
[applause]
Download