>>: It's my pleasure to introduce Murali Haran who... department of statistics in Penn State University. He's visiting...

advertisement
>>: It's my pleasure to introduce Murali Haran who is a faculty member at the
department of statistics in Penn State University. He's visiting the University of
Washington on sabbatical. And today he's going to be talking to us about Gaussian
processes and implicit likelihoods.
>> Murali Haran: Thanks very much, Chris. Thanks for the invitation. So thank you all
for showing up. I'm not sure how off the wall this topic is here but there's quite a variety
of interesting work going on in this place, so hopefully this will be interesting to several
of you.
All right. So I'm just going to jump right in, rather than start off by rereading the title. So
the topic is dealing with complicated scientific models. And just brief motivation for this
work.
So the scientists -- this all started because a few different scientists came to talk to me
at Penn State because they needed help with their Monte Carlo methods, and I've
worked quite a bit on Monte Carlo methods, so they came to talk to me about Monte
Carlo methods. But then it turns out that solving their problems Monte Carlo was a
small part of it. So I had to learn all this new stuff. So this is what this talk is about, the
new stuff that I had to learn because these scientists came to me with their difficult
problems.
So that's -- the problems that they come with, they are often interested in sort of the
mechanisms underlying physical phenomena. So these are -- this is to draw a
distinction between their models and purely statistical models where you're trying to just
fit data.
So but it turns out that they're also useful for predictions and projections, and the reason
why I make this a separate bullet of course you think models are useful for predictions
and projections, but it's worth pointing this out because these models have mechanisms
and physics built in, and so you could think about doing things that you would ordinarily
not do in statistical models, which is you don't want to extrapolate statistical models very
much.
But with these models, maybe you're okay with extrapolating. And so it's really critical
to work with a model provided by the scientists. And I have this up here more for
statisticians than anybody else, because they often complain saying this is unrealistic,
why don't you just do blah, you know, do this cool statistical model. And for these two
reasons, because you want the -- you want the actual physics to help you do
extrapolation and you're actually interested in the underlying physical phenomena, you
need to work with the models they provide, and then maybe you will build statistical
layers on top of that. Okay.
So these scientific models are numerical solutions of mathematical or deterministic
models or stochastic models that reflect scientific processes. And off their -- these
models are translated into computer code where you study the simulations of the
physical process at many different parameters or initial conditions.
Okay. So what are some of the challenges? Often these models and simulators are
very computationally expensive. So the climate scientists I work with it maybe take two
weeks, maybe a month to run some of the more expensive models. And it's also not
automatic. Every new run there's a PhD student in geosciences who is tinkering with
things to make sure things work right.
And it's not possible to write closed-form expressions for many of these models, so I
cannot write down any mathematics that describes it. It might be system of a few
hundred or a few thousand differential equations. And you cannot translate it into math.
The likelihood function may be expensive to evaluate. So if I'm talking about the
stochastic model, maybe I can write down the model, but -- and it looks nice, but it may
be expensive to evaluate. So if you want to do maximum likelihood or Bayesian
inference, this is problematic.
And then this is sort of an obvious statement, but it's what's reminding people about this
is that the model is not for any initial setting or calibrated values of initial -- of the
parameters, it may not -- it's not going to reproduce reality, and you need to make an
adjustment for that. So that's modeling discrepancies.
So this is why the word implicit is being used here. So the likelihood is often implicit,
which means you don't actually -- literally you don't have a likelihood function because
it's deterministic model and there's no probability model, so there's no likelihood. Or
there is a probability model and there is a likelihood, but it's intensive. Or there's other
problems with it, and so you treat it as implicit. It's there, but you're not going to actually
deal with the likelihood directly.
So I'm going to talk about two examples. Depending on how things go, mostly talk
about this, and then briefly talk about the second project. So the first one is -- was
brought to me by climate scientists who have these models that they use for projecting
the behavior of global ocean circulation systems. And the other one was brought to me
by these disease dynamics researchers at Penn State who are interested in seeing how
infectious diseases spread.
This is deterministic, and this is stochastic.
Okay. So just a little bit of the science underlying it. If you've seen Al Gore's movie,
maybe you're already familiar with this. The Meridional Overturning Circulation. All you
really need to know about this is it's a way so that the red arrows here tell you that sort
of the warm waters from the equator are moving out towards the poles and then the
colder water from the poles are moving back to the equator. And so there's this sort of
very, very, very slow mixing that's occurring.
And what you need to take away from this is that if that mixing doesn't continue the way
it has, it's -- it somehow will disturb the equilibrium that the climate system has right
now.
So essentially if this mixing slows down or collapses is the word they like to say -- that
they like to use, then this is going to start causing dramatic changes to the climate say
of northern Europe, among other places.
So this is called the Atlantic Meridional Overturning Circulation, and this is, and this is
key to maintaining the climate system. So the collapse of this may result in dramatic
climate change. And so this is a very complicated model that they have to try to model
this MOC. And one of the things that's -- there's lots of stuff that needs to be tuned.
You know, think of them as inputs or I'm just going to refer to them as parameters
because it turns out that both the scientists referred them as parameters as well as
statisticians referred to them as parameters. So I'm going to refer to them as
parameters henceforth.
But think of them as sort of things that you calibrate in your model. Kv is one of them
and Kv is a particularly important parameter that influences how these models work. So
Kv quantifies the intensity of vertical mixing in the ocean. So you can see how that
might be related to how the circulation system behaves.
And it turns out that this thing cannot be measured directly, so you have to rely on
indirect information. And this is -- these are observations. So this is what one would
call data in the usual sense. Observations of two ocean tracers. So these are spatial
fields, information gathered literally ocean liners going off and dropping things into the
ocean and measuring things.
So these are the two traces, C-14 and CFC-11. This provides some information about
the mixing. And then this is sort of -- this is climate model output not typical -- our
typical notion of data because every time I feed an input I get the same value out. It's
deterministic, right? So for different values of Kv as input I'm going to get different
values of the spatial field as output from a model. And so these guys will look -- they
are trying to mimic these guys. Okay?
And of course they're going to be different. So now this is no longer just the two -- this
is a spatial field Z1. This is spatial field Z2. Now, I have spatial fields Y1 and Y2 that
correspond to this, but they're not functions of the input Kv. So far so good I hope.
Okay. And just to give you idea of -- this is a very, very crude image plot. But just to
give you an idea of what we're dealing with, this is a spatial field. This is a slice of it for
CFC-11, one of the two tracers at different settings of Kv. .05, .2 and .5. And these are
the actual data. Okay? No processing, just some smoothing. It's a very crude picture
just to give you an idea that well, these things vaguely look like the real data. Nothing
quite matches it, but what these scientists would do before they were talking to
statisticians or doing something more elaborate would be essentially finding something,
some output that was most closely behaving like this data and then declaring that to be
the winner in some fashion. Right?
>>: These are models that are fit to the data, these are models fit to other physical
phenomena.
>> Murali Haran: That's right. That's right. Okay. In all honesty, they do stare at some
aspects of the data when they're building these things. But your answer sort of the -- I
mean, what you're saying is how they want it to be. But at some level, they're ->>: [inaudible].
>> Murali Haran: They peak. At some level they're aware of what's -- of what the data
looked like.
>>: Yeah.
>> Murali Haran: Yeah.
>>: And so are you saying that seeing how closely they match you may eye ball it or ->> Murali Haran: No. You want to be very crude about it, you might do something like
root mean squared. And then whichever one has the lowest one is declared the winner.
Or if you -- if they -- they recognize that they want a distribution over the Kvs, so, you
know, they're smart people. So they will do something that's very ad hoc like treat those
distances as some way of getting -- of probability distribution even though it has nothing
to do with probabilities. And then they get something, some kind of distribution over the
values of Kv.
But they're not happy with it, which is why they came to talk to me.
Okay. And so this is a cartoon. And machine learning I know that you guys have seen
much cooler cartoons than this. But this is just to remind you of what we're doing. So
these are the different inputs. The green guys are the -- this is different values of Kv.
The green ones are what we've actually got a run our model at. And F of X1 and F of
X2, this is now a spatial field and a spatial field. Actually it's bivariate spatial field
because I'm looking at two tracers and bivariate spatial field here.
So you can see what we're -- what I'm dealing with. I have an input and the output is
two spatial fields. Okay. So sort of a complicated output. And X star is the one where I
don't have -- where I haven't run the model at and F of X stars. And this, if you want to
just think of it -- this is sort of a non-parametric regression problem, but the output is
functional so it's sort of a functional data problem.
But that's sort of underlying it. If you're just interested in interpolating. But now we're
interested in doing statistical inference based on this, right. So I'm working towards
building a tool to do statistical inference.
So I'm going to fit the emulator to a training set from the complex model. So I've run the
model at different settings. And I'm going to fit this thing called an emulator, and the
emulator is actually going to be stochastic. Remember that the model is deterministic
so it's a bit peculiar that I'm doing this. But I'll explain to you why.
For starters, by doing -- having the stochastic interpolator, I have in essence a fast
approximate simulator. So I have put in new input. I can actually simulate the output.
And because it's stochastic, I can actually get uncertainties associated with the
prediction. So there's going to be greater uncertainty where there's less training data.
Okay?
And this is -- this actually -- my climate scientist friends love this, because they're
physicists by training and physicists think about uncertainty in this way. They think of
uncertainty as, you know, if they got all their physics right they don't need probability
and statistics. That's sort of how they think. I'm exaggerating a little bit. But that's sort
of how they think.
And so they like this expression of uncertainty. It's going to be more uncertain when
you're far away from the values where you have training data. Okay? And of course
Tony O'Hagan, who has thought a lot about this thing is a voucher in the UK, said
without any quantification of uncertainty it is easy to dismiss computer models. So this
is another nice reason why you want to actually quantify uncertainty probably.
And then, finally, this provides a probability model, which is important because I'm now
going to go away and do statistical inference based on this, right? If I don't have a
probability model, I cannot to statistical inference.
Okay. So now, I'm going to use the Gaussian process as a way to do this interpolation
and to build this approximate probability model. So for those of you who work -- I know
some people in machine learning live with Gaussian processes. How many of you here
use Gaussian processes much? Anybody?
>>: [inaudible].
>> Murali Haran: Okay. Okay. So you can zone out for a little while if you like. So the
Gaussian processes are very useful models for dependent processes. And this is
actually how I got -- I first -- my explore to Gaussian processes was first in spatial
models. And time series. And there are -- turns out that they're also very useful for
modeling complicated functions. And the key idea -- this is peculiar if you're not used to
thinking about Gaussian processes in this way. Because I used to think about it as
modeling dependence, but it turns out that it's very nice for modeling complicated
functions.
And the idea is that the dependence sort of adjusts for non-linear relationships between
input and output. And this is still peculiar but maybe my toy examples will help you in a
little bit.
But before I get to the little examples, just some quick review of what Gaussian process
models are. Or overview, for those of you who are new to this.
So supposing I look at -- supposing I look at a process in location S and some domain
D. So just think of D as being -- why not just think of it as two dimensional Euclidean
space so that process Z, living at S now has some mean function that describes where
does that -- sort of the mean of the process at location S, plus W of S. Which is the
spatial dependence process.
So now this looks an awful lot like a linear regression. The only difference is that
everything has a -- an index associated with it. That's the only difference so far, right?
There's an S associated with it. This thing is your mean function and this is your error.
If you're doing linear regression, this would just be independent identically distributed. If
you're doing simple linear regression.
But an interesting thing to notice is this location S may be physical, so it might actually
be geographical index. Or it may be from input space. So if you're in 2D space it might
be parameter theta 1 here and theta 2 over there. And what we do is we model
dependence upon the spatial random variables by modeling this W of S as a Gaussian
process and Gaussian process is just -- it's an infinite dimensional process such that if
you take any finite collection of points S1 through SN in this say two dimensional
Euclidean space then this -- the collection of these random variates is a multivariate
normal. Okay. And I'm going to assume that I'm -- I'm using a multivariate covariance
function so everything is positive definite and so on. And here's just an example. Just
to give you an example of how it works. The covariance between the process at
location SI and the process at location SJ is maybe kappa times this function that
decays as I increase the distance between these two things, right?
And so this just gives you a quick idea of how this works. So as I increase the distance,
the dependence gets weaker. And there are two parameters, phi and kappa that are
positive that I can fiddle with to control how the dependence decays.
So two things close to each other, lot of dependence, two things far away from each
other, weak dependence. And how it decays is determined by these parameters.
Okay. And so now this vector Z, Z of S1 through Z of SN, which is going to be my
spatial field, is now given the parameters theta that determine the covariance and beta
that determine the mean function is going to be multivariate normal. For any selection
S1 through SN. So a finite collection of these random variables.
Okay. So once I've specified this model -- right now I'm just talking about Gaussian
processes and Gaussian process modeling, okay. So once I've specified this model
inference and prediction can be done by a maximum likelihood or Bayes, and maximum
likelihood just means I'm going to maximum that likelihood with respect to theta and
beta, and Bayes means I'm going to put a prior and theta and beta and then do Markov
chain Monte Carlo something else to learn about this posterior distributions pi of theta
beta given Z.
This Gaussian process. But this is key how I do predictions. So once I fit the model -so I have my data and I fit a model, so I have learned about theta and beta, I can now
predict at new locations, let's call it S1 star through SM star and so this Z star is going to
be the collection of predictions at S1 star through SM star. And under the Gaussian
process assumption, these two guys have a joint multivariate normal distribution
specified in this fashion that the right-hand side is just notation. Multivariate normal.
And once I have these parameters, I can tell you everything there is to know about that
multivariate normal distribution.
And if you go back and just do very basic, you know, first year graduate level -- you
know, first few weeks graduate level multivariate normal theory you know that the Z star
given these guys is multivariate normal and you know the mean and covariance.
And if you're doing Bayes, it -- you -- it's just one step beyond that. You're going to
average over the uncertainty due to those parameters. So essentially what's the
take-home message here? Once you fit the model, you now have a model for all other
-- for the random variate at any other finite set of locations in the same space.
Okay. And this is just to give you an idea of how it works for dependent processes. So
these black dots are -- is just a simulation from an AR-1, and autoregressive 1 time
series. So the black dots is just a simulation from that. And if I ignore the dependence
and I just fit a regression without dependence in the error, I would get the -- my
interpolator would be these green guys. So the green -- this thick green one would be
sort of my best prediction. And then the dashed ones are the interval, the prediction
intervals.
So you see it's missing a lot of the structure of my data. But if I fit the Gaussian process
to this, so the blue and the red they're Gaussian processes with different kinds of
covariance functions. And I'm just putting it in there to tell you that you have a lot of
flexibility with how smooth you can make these processes.
So the red one is in some ways the better fit, right? The blue one oversmooths,
depending on what your assumptions are about the process I'd say that the red one is a
nicer fit. And it sort of picks up all the wiggles in the data. Okay?
And so ignoring dependence you do poorly, you put in the dependence you do better.
This is kind of obvious because the original process I used to simulate this has
dependence in it. So this is a little less obvious except for people who do this in
machine learning. You can actually use the stochastic model to do something nicely for
complicated functions. So take these two toy examples. So the sine curve and sort of a
damped sine curve and my black dots are essentially, you know, correspond to
input/output pairs for this function that I've kept hidden from my model.
So I pretend that I don't know the function. And I see XY pairs. Using that -- those -the sine curve and the damped sine curve. Now, I if it a Gaussian process to both of
these. So here's the model I'm going to use. The same Gaussian process model to
both of these. This is a constant mean, which seems like a ridiculous thing, right? I'm
assuming that there's a constant mean. Plus W of X. And this is a Gaussian process
over this region zero to 20. So that's my 1D spatial domain.
I just use this very, very simple model. And the beauty of this is that I didn't need to
know anything about the underlying function that was the non-linear, linear functions
that were used to produce these two things and it just does the right thing. It adapts
and it produces a nice interpolation and it gives you prediction intervals. All right? And
you can sort of get an idea of how the dependence helps. Essentially if you have a
point here and you have a data point here and a data point here, you're trying to figure
out -- you're trying to make a prediction somewhere in between, the dependence tells
you that, well, it's going to look an awful lot like this and an awful lot like this, and it's not
going to look as much as the guys that are further and further away. So the prediction
gets pulled towards that points that are nearby. And that automatically deals with the
non-linearities. And this is the trick that one uses to do this emulation of complex
computer models. And it works very nicely. You can imagine -- you know, this is a toy
example.
So if we were just sitting just to look, we would go oh, let's do something that's a sine or
a damped sine. But imagine the spatial field and 2D spatial field with very weird
non-linear relationships possibly between them and you don't really have ideas for how
to fit this.
But this just seems to work remarkably well.
Okay. So the summary of the inferential problem. Is we're interested in learning about
this parameter theta. So to take you back to where we started, theta is Kv. Remember
the input that I said had a lot to do with the prediction of the MOC, this circulation
system. We're interested in learning about Kv because if I tell you something about Kv,
I can tell you something about what the MOC is going to be, and I can tell you
something about the uncertainties associated with that.
So the statistical problem then ends up being the following. So you're going to take the
model output, which is a bivariate spatial process at each theta. So I'm going to say
that the psi 1, psi 2, psi K are the K different spots at which I ran this model, and I have
-- this is a spatial process, this is a spatial process and so on. That's sort of my
computer model output. And then I have the actual data, which is spatial process 1,
spatial process 2.
Okay. And so what can we learn about theta given both these pieces of information?
And so I use the Bayesian approach to do this, and it turns out that it's really the right
way to proceed here. Because there's usually real prior information about theta. So the
scientists know something about the physics and the parameters there.
And the likelihood surface for theta is often multi-modal. Of if it's poorly informed or it
can be multi-modal. And then there's issues with identifiability sometimes if you have
high dimensional input space. And in any case, it's just nice to have access to the full
posterior distribution. If you do sort of an optimizer and try to learn about the shape of
this thing by gradients and so on, it's just awkward. You don't learn as much as you
would like to.
But having access to the full posterior distribution is very helpful. And if theta is
multivariate, it's important to look at buy variety and marginal distributions. And it's just
a lot more convenient to use a Bayes approach. So I'm advocating a Bayes approach
as a very practical way to proceed, rather than just, you know, giving you philosophical
reasons why you should do it. They're of course nice philosophical reasons to use it.
And then this thing is amenable to a hierarchical specification. And this is useful when
I'm specifying the multivariate spatial process.
Okay. So here's the two stage approach. I might skip over some of the details of how
this is done. And even my slides skim over the details. But I'm hoping that this outline
is completely clear to everyone. So here's how we split up the inference here, problem.
We first think of the problem as a problem of figuring out a probability model for the data
using the computer model. All right? Because right now if I -- there's no direct
connection between the input, the Kv, that parameter that we're interested in, and the
data that we will observe. I don't have a probability model connecting them.
My only hope is to use the computer model runs as sort of a surrogate, right? So I'm
going to find a probability model for Z using the simulations. And I model the
relationship between Z and Y -- sorry, Z and theta via this flexible emulator for the
model output. Okay?
So this eta of Y theta, this is now my Gaussian process emulator. Okay? So this is
telling me if I plug in a value of theta what kind of computer model output I will get. And
then if I add in a discrepancy term, so this is -- we spent a fair bit of time fussing over
this as well, I'm acknowledging the fact that even if I put in the theta value that's sort of
the perfect Kv input, the output is not going to match the observations. And it's not just
going to be IID noise. IID noise would be this term. That be maybe systematic
discrepancy. For example, towards the polar regions, these models may generally have
a positive bias and things like that. So you want to actually allow for those kind of
discrepancies. So you need to do something fairly flexible for this discrepancy term as
well.
But once you have done this, one we -- so this is the hard work. This is really hard
work. You figure out how to build this approximate probability model. Once you've built
the probability model, then it's easy. Right? Easy module computing. It's easy
because once you have the probability model, you have a likelihood. And when you
you have a likelihood, I was thinking of a prior distribution and I have a posterior
distribution now. Yes?
>>: [inaudible] something that I'm missing.
>> Murali Haran: Yes?
>>: [inaudible] models and so the computer models are very complicated. And I would
like to believe that they are complicated with a reason, right, [inaudible] very
sophisticated.
>> Murali Haran: Yes.
>>: System. But now you are building all kinds of naive approximation for this model.
So is the conclusion that the models are complicated and can be simplified [inaudible]?
>> Murali Haran: So this is a simple approximation but it's very flexible. So the
assumption underlying this is that it's an approximation that assumes that there's a
certain -- so let me summarize the question here.
The question is the original computer models are very complicated. You have replaced
it with a much simpler model. So the idea is -- so the question is, is the simpler model
actually saying that the original computer model isn't nearly as complicated as people
think? And no, the computer model is still very complicated. But the idea is it's -- think
of, you know, higher dimensional version of the damped sine curve or something a little
bit more elaborate than that, and a more complicated function of course. But you can
still do statistical interpolation quite nicely with this -- this -- these models, even though
they look simple, they're remarkably flexible.
So the big assumption here is that there's a certain kind of smoothness. As I'm moving
an input space, the output is going to change smoothly. So that's sort of the
assumption. If there's a certain disagree of smoothness that you're willing to assume,
then you have enough information based on the computer model runs that you can
actually replace it with this statistical approximation. That still does a reasonable job.
That's what we're getting at.
>>: Does fair way to characterize this mean that it's not so much, you know, maybe
emulator is the word that's making this a little bit confusing. I had a similar perception to
running -- there may be a different way to characterize it. You said that you're actually
taking a whole bunch of different model runs from the computer model.
>> Murali Haran: Yes.
>>: And you're actually using GP approach to your interpolation between those based
on the parametric space that you cared about. Either way that gives you uncertainty
around, you know, how far are you away from interpolating planes?
>> Murali Haran: That's exactly it. That's all of it.
>>: The emulator threw me a little bit because I was like oh, you're making a
replacement for the model.
>> Murali Haran: But in a way when you're doing an interpolation, you are making a
replacement for the model.
>>: That's true.
>> Murali Haran: Right? So everything you said is compatible with what I've been
saying here. Yes?
>>: I wanted to ask -- so you use the computer model to generate these -- this kind of
underlying data and effectively you want the -- is it true that you want to use the GP to
model the residual in the computer runs and the data you have observed effectively?
>> Murali Haran: Effectively you -- well, not quite. I'm also -- well, that's -- that -- the
residual modeling, that's going into the discrepancy function here.
>>: Right.
>> Murali Haran: And it turns out that that's also a Gaussian process, which is why I
answered in the affirmative first.
This is also Gaussian process, but I'm not going to get into that detail. You need
enough flexibility to model the residuals as well. But before I do that, I need an
approximate model to the original computer model, right? And notice even if -- first of
all, this works a lot better than you might think. It's a very simple -- it's a very simple
looking approximation. But it works quite well as an interpolator. And what's more,
because it's stochastic, if you're far away from a point where you actually have a run,
you're sort of honestly characterizing the fact that I have more uncertainty with my
interpolator at those points.
So this -- this works out well. And then we do sort of these perfect model experiments
where we -- we hold -- you know, do different kinds of cross-validation type exercises.
And then we add all kinds of noise in there and we see how it works. And it's quite
remarkable how well this works. Yes?
>>: So if [inaudible] one of the discrepancy between your new model and the emulation
model, this is just you could have as well built a term for [inaudible].
>> Murali Haran: Sorry. It's not -- it's not ->>: [inaudible].
>> Murali Haran: No, no. So the discrepancy is actually -- what I've done is I've taken
the computer model and I've replaced it with this approximate model. This is modeling
the distance between the approximate model and the data. Not the -- not my
approximate model and the original model. The approximate model and the data.
>>: But is there any reason not to have a term specifically for discrepancy between the
computer model and the data?
>> Murali Haran: No, but that discrepancy is already -- so the computer model and the
emulator, right? That's already build into this guy. Because -- right?
>>: [inaudible] same process [inaudible].
>> Murali Haran: Yes.
>>: [inaudible].
>> Murali Haran: Yes. If you saw -- well, let me see if I -- if you actually have points
that look like this -- I'm going to try to exaggerate this just to make a point.
At this point, my prediction is going to be pretty much that I get -- you know, I know this
computer model exactly, so I might -- my interpolator might look something like this.
And I'm exaggerating this, just to make a point.
So in between the intervals are going to get very large. But the reason that you didn't
see in my toy example is you didn't see it actually hit these points. There are a couple
of reasons. One is that you allow for something called microscaled variation. But I'm
getting into too much detail here. But -- and also there's some computational
advantages to allowing for error even at the place where you have observations.
Too much detail. But this is to give you an idea of the fact that as I get away from this,
I'm automatically accounting for the uncertainties here already. And then this is the
additional uncertainty that's explaining the gap between the computer model and my
observations and then this is sort of the IID noise on top of that. Yes?
>>: [inaudible].
>> Murali Haran: Sure. No, no, no. I like the questions.
>>: So when you learn the model, right, you know that the prediction would be
somewhat accurate as long as the test distribution is similar to the to the training
distribution.
>> Murali Haran: Yeah.
>>: So when you train the first [inaudible] you can fit in data to the -- to the complex
simulator [inaudible].
>> Murali Haran: Right.
>>: But your observations for the delta are unlimited by reality.
>> Murali Haran: Yes.
>>: But actually what you are trying to predict is what happened, what situation
changes. So you are trying to see what would happen if the simulation [inaudible] is far
from [inaudible] would be your estimator.
>> Murali Haran: So the question was we're trying to predict in sort of a domain that's
output where we're at, right? So at this point, that's not how you want to think about it.
At this point, we're thinking about producing inference for theta that allows me to
produce a model that looks like what I have today.
Once I've done all of that, I'm going to use the physics of the model to extrapolate. So
I'm -- yeah?
>>: [inaudible].
>> Murali Haran: Correct.
>>: [inaudible].
>> Murali Haran: Correct.
>>: [inaudible] and then you use a full blown simulator to [inaudible].
>> Murali Haran: That's right. Calibrate now and then use this to run the model
forward. That's why you need these deterministic model. If you did statistical
interpolation of any kind it would just be junk. You're not doing interpolation, you're
doing extrapolation [inaudible] so it's a great question.
Okay. So -- all right. I think I'm going to skip over these details because we've had
enough interesting questions. I want to get through to the -- and this is sort of a high
level view of it. But essentially that first step that I said where you're doing an emulation
of the two state spatial process, we do a lot of stuff to allow it to be flexible enough to
actually interpolate the process as well. So I'm just going to skip over this. This
corresponds to stage one of the emulation. Yes?
>>: [inaudible] stated clearly one dimensional here that [inaudible] one dimension?
>> Murali Haran: Yes. Yes.
>>: And the two dimensional spatial variation.
>> Murali Haran: Correct. Correct. Yeah. So the input is low dimensional here. It's
one.
>>: [inaudible] multi-dimensional by you [inaudible].
>> Murali Haran: Yes. So that's -- those are the details here. So this is the output Y2
and I build this hierarchical model Y1 given Y2. And by doing this in this hierarchical
fashion then I can put in a lot of flexibility to the model. So I modeled the relationship
between ->>: Between Y1 and Y2 also you have a Gaussian model?
>> Murali Haran: Yes. Well, Y1 and Y2 individually are Gaussian. But Y1 given Y2 -sorry, Y1 given Y2 is Gaussian and Y2 is Gaussian. And there's a relationship between
the two.
So the -- those are details that I think I'm happy and maybe you are all happy that I am
skipping over.
>>: [inaudible] is that why you treat Y1 [inaudible].
>> Murali Haran: No. That discrepancy is accountable for by the term delta. Yeah.
This thing is mainly to say that the two spatial processes are related. And we need to
somehow allow for the fact that they may be related in sort of non-linear ways.
So going back to this, the first step we fit the Gaussian process model. So fitting it
means I have built this two layered model. And I have lots of parameters. And I've
fiddled with those -- essentially I do maximum likelihood fit to get nice parameter values
that then gives me a probability model for the computer model output. But now I want
something that also allows for the fact that there's additional error that gets me to the
actual observations. And once I do all of that, I can do Markov chain Monte Carlo to get
to the posterior distribution that I'm interested in.
This is the ultimate goal here. Okay? And then there's a lot of computational issues.
So I have to make some decisions on the fly here. So I think -- I think I might skip some
of these things too and just again give you a whiff of a related idea. Because these are
really details that if you're working on this particular project may be particularly
interesting to you. But most of you know that if you're dealing with matrix operations
and the matrices are end by end matrices, the operations [inaudible]. It's all order of
NQ, right? And if N here in our problem is tens of thousands and I'm doing -- and I'm
having to do that at every stage of a Markov chain Monte Carlo algorithm, this is
hopeless. So we actually use a reduced rank approach based on kernel mixing.
And some of you -- those of who you work on Gaussian processes know what I'm
talking about. But so essentially you would take a very complicated Gaussian process
model and reduce -- and make the covariance into a patterned -- into a specific kind of
co variant structure that lets you use a couple of these identities, the
Sherman-Woodbury-Morrison and Sylvester's Theorem in order to do your
computations in a much faster, slicker way.
So where you had order NQ, you deal with order J cubed. And I'm just going to skip
over those identities and just cut to sort of the conclusion. At the end of all of this work
what is our summary? Our summary is we get Kv, a posterior distribution, the blue guy,
is the final product that we get from all of this work. And we have -- you can see there's
a prior distribution, the black one. And then the red and the green come from -- you
know, we fiddled around and see what you can learn from the individual traces, just one
tracer at a time.
So Kv, it turns out, is very important for calibrating this -- this model and these kinds of
models. But there are other -- there are many other parameters that those of you who
follow climate science discussions or debates if you want to call them debates, this
climate science literature, there's a lot of -- there's a parameter called climate sensitivity
that is a huge subject of research. And, you know, there are lots of science and nature
papers where essentially the conclusion is a plot like this. So -- and our methodology,
you know, extends to working with, you know, any kind of parameter, essentially.
So that, itself, is often an end goal for a scientist. But in this case, we can use that to
build -- to do projections. Okay.
So just to summarize, what we did was we obtained the probability model that
connected the tracers to Kv. And the hierarchical model with Gaussian processes with
pattern covariances are flexible -- is flexible and computationally tractable. And once I
use this probability model I can infer Kv from observations. And then once I've inferred
Kv, I can then say something about the MOC. And we find that MOC weakens slightly
over the next 50 years. That's sort of the ->>: [inaudible] somewhat and running the model forward or is this more a Bayesian
thing ->> Murali Haran: So we're doing -- I skipped over that detail. But we're doing
something that's sort of an amalgam for computational reasons and the identifiability
issue. So we do some maximum likelihood is part of it. And then the final inference that
we do -- so this is entirely Bayesian. So the details involve doing some amount of fixing
of parameters at the initial -- at estimates obtained from maximum likelihood and then
coming back and doing Bayes.
>>: [inaudible] nice posterior but it seemed, you know, kind of smooth up at the top, you
know, and, you know, even the blue ones are multi-modal [inaudible].
>> Murali Haran: Yeah. Well --
>>: [inaudible] one point and ->> Murali Haran: Yeah. I'm not sure that I would -- you know, if you look at posteriors
from MCMC algorithms, this is -- this is about as smooth as they usually get. So I
wouldn't go so far as calling this multimodal. But I think we declare that [inaudible].
>>: [inaudible].
>> Murali Haran: Okay. All right. So let me see. How do I do the time -- I have about
another -- since we started about 10 minutes late, I won't -- I won't is inflict myself on
you for too long. But maybe I'll go for another 10 minutes.
>>: [inaudible].
>> Murali Haran: Okay. So I'll just -- I'll just give you an overview of this. This is
another -- it's a very cool problem that -- okay. Cool to me. That we started working on
with infectious disease collaborators. And the reason that I wanted to talk about this as
well is to give you an idea of how the tools that I discuss here may be useful in other
problems as well. And sort of in a surprising way in this problem.
So this is an overview of this slide. The scientist -- the infectious disease dynamics
people at Penn State came to me with this model that they wanted to fit. The difference
between this model and -- so they came to me with sort of a more obviously statistical
problem.
They have a stochastic model. We can write down a likelihood and they have data and
they said we'd like to fit this model. And again, they came to me because I've worked in
Monte Carlo methods. And you think -- so this shouldn't be that hard a problem. But
the -- the issue with our model is -- and their model you can write down very easily the
math -- you know. It's just a -- in one page you can write down the entire likelihood
function. But the problem is that -- and this is a model that you can -- you can -- that's
used to describe how diseases spread. In particular this is used for measles epidemics,
for modeling measles dynamics.
But the problem is that there are thousands of latent variables. And there are things
that you don't observe. So the number of immigrants moving from one location to
another, if you have measles and you move to another location, you're going to
probably have -- get other people to have measles as well. To get measles as well. So
you don't observe all that information.
But you need that latent process in there to somehow account for that. And if you -- if
you're stuck with latent variables, lots of latent variables, you know that you need to
integrate those latent variables out when you're performing inference and if you have
thousands of these, this is a hard problem.
And the space-time data set is lovely because it's very rich and very nice from England
and Wales. So it has 519 points, which is great, because you can actually learn about
this disease model. Often the information is sparse. So here, this is great. But of
course it presents some computational challenge. So you cannot just do something
naive and have this work.
And this is to give you an idea of the kind of data. We have many different cities, big
cities, London and this is just telling you the number of disease cases. You know, it's
biweekly data. And these are the smaller cities.
And the reason I'm showing you this is, first, to give you an idea of what kind of data we
have. It's time series in 952 different cities, 952 plots like this. That's the data. And you
can see that it's quite different when you move from big cities to small cities, it's quite
widely different.
And that messes up doing the likelihood inference that we would like to do.
>>: [inaudible].
>> Murali Haran: Yeah. There's a lot of seasonal stuff here. And you can guess what
one of the seasonal things will be. Measles has a lot to do with school kids. And so
when schools are in session then there's a spike and so on.
Okay. So there's some issues here. The stochastic model is expensive -- results in a
likelihood that's expensive to evaluate. It's not very expensive. But it's just expensive
enough that you can't do -- cannot do Bayesian inference easily. And I'm going to skip
over this. Does anybody here know about approximate Bayesian computing? No.
Then I'm going to skip over this.
Essentially there's no other approach for doing this easily, so we developed a very
simple solution we thought. We thought we had solved this problem if we do a grid
base Markov chain Monte Carlo and we discretize the parameter space. Maybe some
of you have used these tricks. And so we do a lot of computing in parallel ahead of
time. And then that lets us do medium likelihood and Bayes. But then the problem is -we thought we would solve this problem we were done. But we fit this model and it
doesn't fit the scientifically relevant features of the data. And then if you do sort of a
controlled experiment where we know what the input was and we have the output
corresponding to that -- the parameters that were used, we don't recover the
parameters that we were used. So that's bad news. Right?
But the really crucial thing is it does not fit scientifically relevant features of the data. So
what are scientifically relevant features? These are just a couple of examples. These
are things that the biologist told us they would like their models to mimic. So they have
a fitted model. They would like the fitted model to reproduce a couple of these
characteristics. And all I've done here is I have predictions from our model versus the
actual observations for those summary characteristics.
And you can see that the fit is pretty lousy. And so they're unhappy when they see this.
And so how do we solve -- we thought we would solve the problem. But it turns out that
likelihood -- simple likelihood or Bayes doesn't really solve the problem.
So we thought why not go and work directly with the summary statistics, right? Instead
of -- maximum likelihood does not understand what features biologists are interested in
capturing. So we thought how about we move away from maximum likelihood and
Bayes and we move into this world of sort of this features or summary statistics based
on the data?
And so what we then did was we treated as data these summary statistics. And now,
you can go back to thinking about this problem in the way that we thought about the
computer model problem from the climate science research, which is for each input -- I
can run this model, right? At each input I have an output that I now summarize, have
summary statistics. And the summary statistics are now like my output corresponding
to each theta. And then I have observations. And I can use summary statistics based
on those observations. And then now I can use Gaussian process based ideas just like
I did before. That's a bit of a lie. There's lots of details here. There's some
computational issues. And we use some tricks to get the computing to work out right.
But the end result of all of that is that we produce parameter estimates that then result
in the model that fits the data, the things that they care about much, much better.
>>: [inaudible] parameters that [inaudible] I'm not sure [inaudible] care about. But so
for instance for this data, the slices that you have, the like slices in time and space,
right, so now you can create an interpolate [inaudible] approach and take a new
unobserved point in time and space and then create a value. But is that the theta you
care about?
>> Murali Haran: Okay. So here's the funny thing. We can no longer -- our Gaussian
process no longer interpolates the space-time process. In the previous method, we can
actually interpolate that.
>>: Right.
>> Murali Haran: Here, the interpolation we're doing are of these features.
>>: [inaudible].
>> Murali Haran: Yes. So if you give me -- if you give me a new parameter input, I
cannot give you all of these time series directly. But based on the Gaussian process
emulator. But what I can do is I can give you the summary statistics corresponding to
that.
But it doesn't matter because if I actually have the parameters, the stochastic model is
way easier -- you know, I can run the stochastic model. Which I couldn't do with the
climate science things. If I have the input space now, I have the distribution of the
parameters, I can use that to run the original stochastic model. So I can actually still do
sort of predictions or projections or whatever I want based on those models.
>>: Once you get data that you care about.
>> Murali Haran: Yes.
>>: Okay. But so -- okay. So now your GP is on the summary statistics. But still what
is your dependent variable that you're interpolating between? Are they time equals then
now or --
>> Murali Haran: No. So the thing that I'm doing, the interpolation across is that input
theta. So I skip ->>: [inaudible].
>> Murali Haran: Yeah. Yeah. Yeah. So I apologize, you know, but in my skipping
maybe I didn't spend enough time just laying out the fact that theta, just like in the other
case the parameter's controlling the climate -- the process, here again theta -- the
same, you know, red color things, all right, this time they control the dynamics of the
spread of the disease. So once I know what theta is or I have an idea what theta is
based on data, I can actually go back and fit -- and say something about the process in
likely theta values.
Was there another -- yeah?
>>: Should I just take these curves that you show here, are these really the parameter
they care about or are these the sanity checks that they use to check any model to see
if it [inaudible].
>> Murali Haran: These are the -- I'm not sure the distinction between the two. So
these are things that they want -- want to make sure that the model actually reproduces
this. Because they don't think -- they would also like to see the model reproduce the
time series. But this is -- if the model doesn't get this right, it's really missing what they
think are sort of key features of the biology. Right? But they would also like to see the
time series matched.
But now if you think about it, if you have 952 time series, you need to come up with
some kind of metric that tells you how well you're doing. And it's not very obvious how
to do that because there's complications, there's lots of small cities, and you can be way
off in many, many small cities but do very well in the bigger cities and still get the
general picture right, and you're willing to live with the variability with the smaller city
prediction. So if you did something naive you wouldn't actually be fitting it in the right
way. So -- which is why this provides a natural way to think about sort of a summary
measure that we're interested in. And that's why we work with this. Yes?
>>: [inaudible] original data is basically saying that you care about log loss, the
[inaudible].
>> Murali Haran: Correct.
>>: [inaudible] and here basically the [inaudible] saying into the function I care about is
that matching [inaudible].
>> Murali Haran: That's a very nice summary. So it's such a nice summary I'm going to
repeat it.
So the -- what he's reminding us is that maximum likelihood -- you know, in some sense
if you want to think about it in terms of loss, loss functions on the parameter space, it's
telling you -- it's doing something very naive in terms of imposing loss, right? There's a
squared function. Then that's not necessarily what you want. You would rather have
the loss be placed on these guys. And that's what -- that's what we end up doing. And
so -- but -- so it makes sense that our model then fits this. Because now we're actually
trying to fit these guys, rather than trying to fit the original thing. That makes sense.
But the very surprising and bizarre thing is that if we do -- so this is not surprising. The
weird thing is that if we actually do a controlled experiment where we know what the
input was, where we know what was used to simulate the data, and then we hide it from
our procedure, this approach recovers it better than the Bayes or the usual likelihood
based approach. And that has -- we don't know exactly the details why and the
mathematics of it. We don't know exactly how to figure that out. But we think it has a
lot to do with the fact that these smaller cities have a really a disproportionately large
say in how the maximum likelihood approach works and sort of noisy things in the
smaller cities really make -- screw up your inference.
And by looking at these summaries, we're somehow making it do less ridiculous things.
It's -- this is the surprising part of it. We did not expect -- we expected to get this, but we
did not expect that we would also get better parameter estimates. So let me just
summarize and we can come back if you have more questions.
So just an overall summary. So the Gaussian process we think -- I think are very
powerful tool and when the likelihood is implicit and simulating from the model is
expensive. And these are very useful for deterministic and stochastic models. And the
first case was deterministic, second was stochastic.
And we can perform inference based on scientifically important features of the data.
And this is what I talked about in the second -- in the disease modeling approach. And
they're computationally expedient. And, in fact, I didn't say much about this, but -except what I just said was surprising. May actually improve inference and prediction in
sort of the regular way that you would think about doing inference and prediction.
And so this is just acknowledging collaborators. The guys in blue are PhD students or
former PhD students. And so these are just acknowledging support for all of this work.
So thank you very much.
[applause].
>>: First of all, this is a comment and then any questions. So if anyone would like to
[inaudible] there are a couple spots left to schedule. They'll be here for the rest of the
day.
And then are there any other questions?
>>: [inaudible] comment. I think [inaudible] saying this at the end. But the thing that
you saw with the fact that you're recovering data better now with this model must mean
that the summary statistics are also kind of discounting those smaller [inaudible].
>> Murali Haran: Yes. Yes. Exactly.
>>: The cleverness of the summary statistics, however they're constructed, ends up
[inaudible].
>> Murali Haran: That's right.
>>: [inaudible].
>> Murali Haran: That's right. And recently there have been a couple of other people
working on very different projects who have noticed things like what we've noticed.
Maybe there were people before who noticed that as well, but I wasn't aware of it.
But summarizing things in a particular way sometimes -- even if, you know -- in statistics
if you're in the exponential family, if you're dealing with sufficient statistics, then it makes
no difference, right? But in these sort of complicated models, somehow doing -- looking
at sort of particular summaries does better than working with the original data, which is
->>: [inaudible] human engineering and intuition that went into those [inaudible].
[brief talking over].
>>: I mean, it's kind of like I think as machine learning people we're kind of like oh,
always go to the raw data because that's where the real truth is.
>> Murali Haran: Right.
>>: But here you have a case where, in fact, people have looked at and no, no, no, this
is the type that really matters. You know, leveraging that has given you this [inaudible].
>> Murali Haran: Absolutely. And, in fact, you know, that -- that's -- that last part of my
talk on disease dynamics, I -- you know, I give one hour talks on that subject alone.
And there, one of the things that I would often highlight is the fact that this gives us a
sort of a nice way to directly talk to scientists. I mean, this gets scientists excited
because it ->>: [inaudible].
>> Murali Haran: It actually tells them that their notions of what is important is being
built directly that the inference.
>>: Let's thank Murali again.
>> Murali Haran: Thank you so much.
[applause]
Download