>>: It's my pleasure to introduce Murali Haran who is a faculty member at the department of statistics in Penn State University. He's visiting the University of Washington on sabbatical. And today he's going to be talking to us about Gaussian processes and implicit likelihoods. >> Murali Haran: Thanks very much, Chris. Thanks for the invitation. So thank you all for showing up. I'm not sure how off the wall this topic is here but there's quite a variety of interesting work going on in this place, so hopefully this will be interesting to several of you. All right. So I'm just going to jump right in, rather than start off by rereading the title. So the topic is dealing with complicated scientific models. And just brief motivation for this work. So the scientists -- this all started because a few different scientists came to talk to me at Penn State because they needed help with their Monte Carlo methods, and I've worked quite a bit on Monte Carlo methods, so they came to talk to me about Monte Carlo methods. But then it turns out that solving their problems Monte Carlo was a small part of it. So I had to learn all this new stuff. So this is what this talk is about, the new stuff that I had to learn because these scientists came to me with their difficult problems. So that's -- the problems that they come with, they are often interested in sort of the mechanisms underlying physical phenomena. So these are -- this is to draw a distinction between their models and purely statistical models where you're trying to just fit data. So but it turns out that they're also useful for predictions and projections, and the reason why I make this a separate bullet of course you think models are useful for predictions and projections, but it's worth pointing this out because these models have mechanisms and physics built in, and so you could think about doing things that you would ordinarily not do in statistical models, which is you don't want to extrapolate statistical models very much. But with these models, maybe you're okay with extrapolating. And so it's really critical to work with a model provided by the scientists. And I have this up here more for statisticians than anybody else, because they often complain saying this is unrealistic, why don't you just do blah, you know, do this cool statistical model. And for these two reasons, because you want the -- you want the actual physics to help you do extrapolation and you're actually interested in the underlying physical phenomena, you need to work with the models they provide, and then maybe you will build statistical layers on top of that. Okay. So these scientific models are numerical solutions of mathematical or deterministic models or stochastic models that reflect scientific processes. And off their -- these models are translated into computer code where you study the simulations of the physical process at many different parameters or initial conditions. Okay. So what are some of the challenges? Often these models and simulators are very computationally expensive. So the climate scientists I work with it maybe take two weeks, maybe a month to run some of the more expensive models. And it's also not automatic. Every new run there's a PhD student in geosciences who is tinkering with things to make sure things work right. And it's not possible to write closed-form expressions for many of these models, so I cannot write down any mathematics that describes it. It might be system of a few hundred or a few thousand differential equations. And you cannot translate it into math. The likelihood function may be expensive to evaluate. So if I'm talking about the stochastic model, maybe I can write down the model, but -- and it looks nice, but it may be expensive to evaluate. So if you want to do maximum likelihood or Bayesian inference, this is problematic. And then this is sort of an obvious statement, but it's what's reminding people about this is that the model is not for any initial setting or calibrated values of initial -- of the parameters, it may not -- it's not going to reproduce reality, and you need to make an adjustment for that. So that's modeling discrepancies. So this is why the word implicit is being used here. So the likelihood is often implicit, which means you don't actually -- literally you don't have a likelihood function because it's deterministic model and there's no probability model, so there's no likelihood. Or there is a probability model and there is a likelihood, but it's intensive. Or there's other problems with it, and so you treat it as implicit. It's there, but you're not going to actually deal with the likelihood directly. So I'm going to talk about two examples. Depending on how things go, mostly talk about this, and then briefly talk about the second project. So the first one is -- was brought to me by climate scientists who have these models that they use for projecting the behavior of global ocean circulation systems. And the other one was brought to me by these disease dynamics researchers at Penn State who are interested in seeing how infectious diseases spread. This is deterministic, and this is stochastic. Okay. So just a little bit of the science underlying it. If you've seen Al Gore's movie, maybe you're already familiar with this. The Meridional Overturning Circulation. All you really need to know about this is it's a way so that the red arrows here tell you that sort of the warm waters from the equator are moving out towards the poles and then the colder water from the poles are moving back to the equator. And so there's this sort of very, very, very slow mixing that's occurring. And what you need to take away from this is that if that mixing doesn't continue the way it has, it's -- it somehow will disturb the equilibrium that the climate system has right now. So essentially if this mixing slows down or collapses is the word they like to say -- that they like to use, then this is going to start causing dramatic changes to the climate say of northern Europe, among other places. So this is called the Atlantic Meridional Overturning Circulation, and this is, and this is key to maintaining the climate system. So the collapse of this may result in dramatic climate change. And so this is a very complicated model that they have to try to model this MOC. And one of the things that's -- there's lots of stuff that needs to be tuned. You know, think of them as inputs or I'm just going to refer to them as parameters because it turns out that both the scientists referred them as parameters as well as statisticians referred to them as parameters. So I'm going to refer to them as parameters henceforth. But think of them as sort of things that you calibrate in your model. Kv is one of them and Kv is a particularly important parameter that influences how these models work. So Kv quantifies the intensity of vertical mixing in the ocean. So you can see how that might be related to how the circulation system behaves. And it turns out that this thing cannot be measured directly, so you have to rely on indirect information. And this is -- these are observations. So this is what one would call data in the usual sense. Observations of two ocean tracers. So these are spatial fields, information gathered literally ocean liners going off and dropping things into the ocean and measuring things. So these are the two traces, C-14 and CFC-11. This provides some information about the mixing. And then this is sort of -- this is climate model output not typical -- our typical notion of data because every time I feed an input I get the same value out. It's deterministic, right? So for different values of Kv as input I'm going to get different values of the spatial field as output from a model. And so these guys will look -- they are trying to mimic these guys. Okay? And of course they're going to be different. So now this is no longer just the two -- this is a spatial field Z1. This is spatial field Z2. Now, I have spatial fields Y1 and Y2 that correspond to this, but they're not functions of the input Kv. So far so good I hope. Okay. And just to give you idea of -- this is a very, very crude image plot. But just to give you an idea of what we're dealing with, this is a spatial field. This is a slice of it for CFC-11, one of the two tracers at different settings of Kv. .05, .2 and .5. And these are the actual data. Okay? No processing, just some smoothing. It's a very crude picture just to give you an idea that well, these things vaguely look like the real data. Nothing quite matches it, but what these scientists would do before they were talking to statisticians or doing something more elaborate would be essentially finding something, some output that was most closely behaving like this data and then declaring that to be the winner in some fashion. Right? >>: These are models that are fit to the data, these are models fit to other physical phenomena. >> Murali Haran: That's right. That's right. Okay. In all honesty, they do stare at some aspects of the data when they're building these things. But your answer sort of the -- I mean, what you're saying is how they want it to be. But at some level, they're ->>: [inaudible]. >> Murali Haran: They peak. At some level they're aware of what's -- of what the data looked like. >>: Yeah. >> Murali Haran: Yeah. >>: And so are you saying that seeing how closely they match you may eye ball it or ->> Murali Haran: No. You want to be very crude about it, you might do something like root mean squared. And then whichever one has the lowest one is declared the winner. Or if you -- if they -- they recognize that they want a distribution over the Kvs, so, you know, they're smart people. So they will do something that's very ad hoc like treat those distances as some way of getting -- of probability distribution even though it has nothing to do with probabilities. And then they get something, some kind of distribution over the values of Kv. But they're not happy with it, which is why they came to talk to me. Okay. And so this is a cartoon. And machine learning I know that you guys have seen much cooler cartoons than this. But this is just to remind you of what we're doing. So these are the different inputs. The green guys are the -- this is different values of Kv. The green ones are what we've actually got a run our model at. And F of X1 and F of X2, this is now a spatial field and a spatial field. Actually it's bivariate spatial field because I'm looking at two tracers and bivariate spatial field here. So you can see what we're -- what I'm dealing with. I have an input and the output is two spatial fields. Okay. So sort of a complicated output. And X star is the one where I don't have -- where I haven't run the model at and F of X stars. And this, if you want to just think of it -- this is sort of a non-parametric regression problem, but the output is functional so it's sort of a functional data problem. But that's sort of underlying it. If you're just interested in interpolating. But now we're interested in doing statistical inference based on this, right. So I'm working towards building a tool to do statistical inference. So I'm going to fit the emulator to a training set from the complex model. So I've run the model at different settings. And I'm going to fit this thing called an emulator, and the emulator is actually going to be stochastic. Remember that the model is deterministic so it's a bit peculiar that I'm doing this. But I'll explain to you why. For starters, by doing -- having the stochastic interpolator, I have in essence a fast approximate simulator. So I have put in new input. I can actually simulate the output. And because it's stochastic, I can actually get uncertainties associated with the prediction. So there's going to be greater uncertainty where there's less training data. Okay? And this is -- this actually -- my climate scientist friends love this, because they're physicists by training and physicists think about uncertainty in this way. They think of uncertainty as, you know, if they got all their physics right they don't need probability and statistics. That's sort of how they think. I'm exaggerating a little bit. But that's sort of how they think. And so they like this expression of uncertainty. It's going to be more uncertain when you're far away from the values where you have training data. Okay? And of course Tony O'Hagan, who has thought a lot about this thing is a voucher in the UK, said without any quantification of uncertainty it is easy to dismiss computer models. So this is another nice reason why you want to actually quantify uncertainty probably. And then, finally, this provides a probability model, which is important because I'm now going to go away and do statistical inference based on this, right? If I don't have a probability model, I cannot to statistical inference. Okay. So now, I'm going to use the Gaussian process as a way to do this interpolation and to build this approximate probability model. So for those of you who work -- I know some people in machine learning live with Gaussian processes. How many of you here use Gaussian processes much? Anybody? >>: [inaudible]. >> Murali Haran: Okay. Okay. So you can zone out for a little while if you like. So the Gaussian processes are very useful models for dependent processes. And this is actually how I got -- I first -- my explore to Gaussian processes was first in spatial models. And time series. And there are -- turns out that they're also very useful for modeling complicated functions. And the key idea -- this is peculiar if you're not used to thinking about Gaussian processes in this way. Because I used to think about it as modeling dependence, but it turns out that it's very nice for modeling complicated functions. And the idea is that the dependence sort of adjusts for non-linear relationships between input and output. And this is still peculiar but maybe my toy examples will help you in a little bit. But before I get to the little examples, just some quick review of what Gaussian process models are. Or overview, for those of you who are new to this. So supposing I look at -- supposing I look at a process in location S and some domain D. So just think of D as being -- why not just think of it as two dimensional Euclidean space so that process Z, living at S now has some mean function that describes where does that -- sort of the mean of the process at location S, plus W of S. Which is the spatial dependence process. So now this looks an awful lot like a linear regression. The only difference is that everything has a -- an index associated with it. That's the only difference so far, right? There's an S associated with it. This thing is your mean function and this is your error. If you're doing linear regression, this would just be independent identically distributed. If you're doing simple linear regression. But an interesting thing to notice is this location S may be physical, so it might actually be geographical index. Or it may be from input space. So if you're in 2D space it might be parameter theta 1 here and theta 2 over there. And what we do is we model dependence upon the spatial random variables by modeling this W of S as a Gaussian process and Gaussian process is just -- it's an infinite dimensional process such that if you take any finite collection of points S1 through SN in this say two dimensional Euclidean space then this -- the collection of these random variates is a multivariate normal. Okay. And I'm going to assume that I'm -- I'm using a multivariate covariance function so everything is positive definite and so on. And here's just an example. Just to give you an example of how it works. The covariance between the process at location SI and the process at location SJ is maybe kappa times this function that decays as I increase the distance between these two things, right? And so this just gives you a quick idea of how this works. So as I increase the distance, the dependence gets weaker. And there are two parameters, phi and kappa that are positive that I can fiddle with to control how the dependence decays. So two things close to each other, lot of dependence, two things far away from each other, weak dependence. And how it decays is determined by these parameters. Okay. And so now this vector Z, Z of S1 through Z of SN, which is going to be my spatial field, is now given the parameters theta that determine the covariance and beta that determine the mean function is going to be multivariate normal. For any selection S1 through SN. So a finite collection of these random variables. Okay. So once I've specified this model -- right now I'm just talking about Gaussian processes and Gaussian process modeling, okay. So once I've specified this model inference and prediction can be done by a maximum likelihood or Bayes, and maximum likelihood just means I'm going to maximum that likelihood with respect to theta and beta, and Bayes means I'm going to put a prior and theta and beta and then do Markov chain Monte Carlo something else to learn about this posterior distributions pi of theta beta given Z. This Gaussian process. But this is key how I do predictions. So once I fit the model -so I have my data and I fit a model, so I have learned about theta and beta, I can now predict at new locations, let's call it S1 star through SM star and so this Z star is going to be the collection of predictions at S1 star through SM star. And under the Gaussian process assumption, these two guys have a joint multivariate normal distribution specified in this fashion that the right-hand side is just notation. Multivariate normal. And once I have these parameters, I can tell you everything there is to know about that multivariate normal distribution. And if you go back and just do very basic, you know, first year graduate level -- you know, first few weeks graduate level multivariate normal theory you know that the Z star given these guys is multivariate normal and you know the mean and covariance. And if you're doing Bayes, it -- you -- it's just one step beyond that. You're going to average over the uncertainty due to those parameters. So essentially what's the take-home message here? Once you fit the model, you now have a model for all other -- for the random variate at any other finite set of locations in the same space. Okay. And this is just to give you an idea of how it works for dependent processes. So these black dots are -- is just a simulation from an AR-1, and autoregressive 1 time series. So the black dots is just a simulation from that. And if I ignore the dependence and I just fit a regression without dependence in the error, I would get the -- my interpolator would be these green guys. So the green -- this thick green one would be sort of my best prediction. And then the dashed ones are the interval, the prediction intervals. So you see it's missing a lot of the structure of my data. But if I fit the Gaussian process to this, so the blue and the red they're Gaussian processes with different kinds of covariance functions. And I'm just putting it in there to tell you that you have a lot of flexibility with how smooth you can make these processes. So the red one is in some ways the better fit, right? The blue one oversmooths, depending on what your assumptions are about the process I'd say that the red one is a nicer fit. And it sort of picks up all the wiggles in the data. Okay? And so ignoring dependence you do poorly, you put in the dependence you do better. This is kind of obvious because the original process I used to simulate this has dependence in it. So this is a little less obvious except for people who do this in machine learning. You can actually use the stochastic model to do something nicely for complicated functions. So take these two toy examples. So the sine curve and sort of a damped sine curve and my black dots are essentially, you know, correspond to input/output pairs for this function that I've kept hidden from my model. So I pretend that I don't know the function. And I see XY pairs. Using that -- those -the sine curve and the damped sine curve. Now, I if it a Gaussian process to both of these. So here's the model I'm going to use. The same Gaussian process model to both of these. This is a constant mean, which seems like a ridiculous thing, right? I'm assuming that there's a constant mean. Plus W of X. And this is a Gaussian process over this region zero to 20. So that's my 1D spatial domain. I just use this very, very simple model. And the beauty of this is that I didn't need to know anything about the underlying function that was the non-linear, linear functions that were used to produce these two things and it just does the right thing. It adapts and it produces a nice interpolation and it gives you prediction intervals. All right? And you can sort of get an idea of how the dependence helps. Essentially if you have a point here and you have a data point here and a data point here, you're trying to figure out -- you're trying to make a prediction somewhere in between, the dependence tells you that, well, it's going to look an awful lot like this and an awful lot like this, and it's not going to look as much as the guys that are further and further away. So the prediction gets pulled towards that points that are nearby. And that automatically deals with the non-linearities. And this is the trick that one uses to do this emulation of complex computer models. And it works very nicely. You can imagine -- you know, this is a toy example. So if we were just sitting just to look, we would go oh, let's do something that's a sine or a damped sine. But imagine the spatial field and 2D spatial field with very weird non-linear relationships possibly between them and you don't really have ideas for how to fit this. But this just seems to work remarkably well. Okay. So the summary of the inferential problem. Is we're interested in learning about this parameter theta. So to take you back to where we started, theta is Kv. Remember the input that I said had a lot to do with the prediction of the MOC, this circulation system. We're interested in learning about Kv because if I tell you something about Kv, I can tell you something about what the MOC is going to be, and I can tell you something about the uncertainties associated with that. So the statistical problem then ends up being the following. So you're going to take the model output, which is a bivariate spatial process at each theta. So I'm going to say that the psi 1, psi 2, psi K are the K different spots at which I ran this model, and I have -- this is a spatial process, this is a spatial process and so on. That's sort of my computer model output. And then I have the actual data, which is spatial process 1, spatial process 2. Okay. And so what can we learn about theta given both these pieces of information? And so I use the Bayesian approach to do this, and it turns out that it's really the right way to proceed here. Because there's usually real prior information about theta. So the scientists know something about the physics and the parameters there. And the likelihood surface for theta is often multi-modal. Of if it's poorly informed or it can be multi-modal. And then there's issues with identifiability sometimes if you have high dimensional input space. And in any case, it's just nice to have access to the full posterior distribution. If you do sort of an optimizer and try to learn about the shape of this thing by gradients and so on, it's just awkward. You don't learn as much as you would like to. But having access to the full posterior distribution is very helpful. And if theta is multivariate, it's important to look at buy variety and marginal distributions. And it's just a lot more convenient to use a Bayes approach. So I'm advocating a Bayes approach as a very practical way to proceed, rather than just, you know, giving you philosophical reasons why you should do it. They're of course nice philosophical reasons to use it. And then this thing is amenable to a hierarchical specification. And this is useful when I'm specifying the multivariate spatial process. Okay. So here's the two stage approach. I might skip over some of the details of how this is done. And even my slides skim over the details. But I'm hoping that this outline is completely clear to everyone. So here's how we split up the inference here, problem. We first think of the problem as a problem of figuring out a probability model for the data using the computer model. All right? Because right now if I -- there's no direct connection between the input, the Kv, that parameter that we're interested in, and the data that we will observe. I don't have a probability model connecting them. My only hope is to use the computer model runs as sort of a surrogate, right? So I'm going to find a probability model for Z using the simulations. And I model the relationship between Z and Y -- sorry, Z and theta via this flexible emulator for the model output. Okay? So this eta of Y theta, this is now my Gaussian process emulator. Okay? So this is telling me if I plug in a value of theta what kind of computer model output I will get. And then if I add in a discrepancy term, so this is -- we spent a fair bit of time fussing over this as well, I'm acknowledging the fact that even if I put in the theta value that's sort of the perfect Kv input, the output is not going to match the observations. And it's not just going to be IID noise. IID noise would be this term. That be maybe systematic discrepancy. For example, towards the polar regions, these models may generally have a positive bias and things like that. So you want to actually allow for those kind of discrepancies. So you need to do something fairly flexible for this discrepancy term as well. But once you have done this, one we -- so this is the hard work. This is really hard work. You figure out how to build this approximate probability model. Once you've built the probability model, then it's easy. Right? Easy module computing. It's easy because once you have the probability model, you have a likelihood. And when you you have a likelihood, I was thinking of a prior distribution and I have a posterior distribution now. Yes? >>: [inaudible] something that I'm missing. >> Murali Haran: Yes? >>: [inaudible] models and so the computer models are very complicated. And I would like to believe that they are complicated with a reason, right, [inaudible] very sophisticated. >> Murali Haran: Yes. >>: System. But now you are building all kinds of naive approximation for this model. So is the conclusion that the models are complicated and can be simplified [inaudible]? >> Murali Haran: So this is a simple approximation but it's very flexible. So the assumption underlying this is that it's an approximation that assumes that there's a certain -- so let me summarize the question here. The question is the original computer models are very complicated. You have replaced it with a much simpler model. So the idea is -- so the question is, is the simpler model actually saying that the original computer model isn't nearly as complicated as people think? And no, the computer model is still very complicated. But the idea is it's -- think of, you know, higher dimensional version of the damped sine curve or something a little bit more elaborate than that, and a more complicated function of course. But you can still do statistical interpolation quite nicely with this -- this -- these models, even though they look simple, they're remarkably flexible. So the big assumption here is that there's a certain kind of smoothness. As I'm moving an input space, the output is going to change smoothly. So that's sort of the assumption. If there's a certain disagree of smoothness that you're willing to assume, then you have enough information based on the computer model runs that you can actually replace it with this statistical approximation. That still does a reasonable job. That's what we're getting at. >>: Does fair way to characterize this mean that it's not so much, you know, maybe emulator is the word that's making this a little bit confusing. I had a similar perception to running -- there may be a different way to characterize it. You said that you're actually taking a whole bunch of different model runs from the computer model. >> Murali Haran: Yes. >>: And you're actually using GP approach to your interpolation between those based on the parametric space that you cared about. Either way that gives you uncertainty around, you know, how far are you away from interpolating planes? >> Murali Haran: That's exactly it. That's all of it. >>: The emulator threw me a little bit because I was like oh, you're making a replacement for the model. >> Murali Haran: But in a way when you're doing an interpolation, you are making a replacement for the model. >>: That's true. >> Murali Haran: Right? So everything you said is compatible with what I've been saying here. Yes? >>: I wanted to ask -- so you use the computer model to generate these -- this kind of underlying data and effectively you want the -- is it true that you want to use the GP to model the residual in the computer runs and the data you have observed effectively? >> Murali Haran: Effectively you -- well, not quite. I'm also -- well, that's -- that -- the residual modeling, that's going into the discrepancy function here. >>: Right. >> Murali Haran: And it turns out that that's also a Gaussian process, which is why I answered in the affirmative first. This is also Gaussian process, but I'm not going to get into that detail. You need enough flexibility to model the residuals as well. But before I do that, I need an approximate model to the original computer model, right? And notice even if -- first of all, this works a lot better than you might think. It's a very simple -- it's a very simple looking approximation. But it works quite well as an interpolator. And what's more, because it's stochastic, if you're far away from a point where you actually have a run, you're sort of honestly characterizing the fact that I have more uncertainty with my interpolator at those points. So this -- this works out well. And then we do sort of these perfect model experiments where we -- we hold -- you know, do different kinds of cross-validation type exercises. And then we add all kinds of noise in there and we see how it works. And it's quite remarkable how well this works. Yes? >>: So if [inaudible] one of the discrepancy between your new model and the emulation model, this is just you could have as well built a term for [inaudible]. >> Murali Haran: Sorry. It's not -- it's not ->>: [inaudible]. >> Murali Haran: No, no. So the discrepancy is actually -- what I've done is I've taken the computer model and I've replaced it with this approximate model. This is modeling the distance between the approximate model and the data. Not the -- not my approximate model and the original model. The approximate model and the data. >>: But is there any reason not to have a term specifically for discrepancy between the computer model and the data? >> Murali Haran: No, but that discrepancy is already -- so the computer model and the emulator, right? That's already build into this guy. Because -- right? >>: [inaudible] same process [inaudible]. >> Murali Haran: Yes. >>: [inaudible]. >> Murali Haran: Yes. If you saw -- well, let me see if I -- if you actually have points that look like this -- I'm going to try to exaggerate this just to make a point. At this point, my prediction is going to be pretty much that I get -- you know, I know this computer model exactly, so I might -- my interpolator might look something like this. And I'm exaggerating this, just to make a point. So in between the intervals are going to get very large. But the reason that you didn't see in my toy example is you didn't see it actually hit these points. There are a couple of reasons. One is that you allow for something called microscaled variation. But I'm getting into too much detail here. But -- and also there's some computational advantages to allowing for error even at the place where you have observations. Too much detail. But this is to give you an idea of the fact that as I get away from this, I'm automatically accounting for the uncertainties here already. And then this is the additional uncertainty that's explaining the gap between the computer model and my observations and then this is sort of the IID noise on top of that. Yes? >>: [inaudible]. >> Murali Haran: Sure. No, no, no. I like the questions. >>: So when you learn the model, right, you know that the prediction would be somewhat accurate as long as the test distribution is similar to the to the training distribution. >> Murali Haran: Yeah. >>: So when you train the first [inaudible] you can fit in data to the -- to the complex simulator [inaudible]. >> Murali Haran: Right. >>: But your observations for the delta are unlimited by reality. >> Murali Haran: Yes. >>: But actually what you are trying to predict is what happened, what situation changes. So you are trying to see what would happen if the simulation [inaudible] is far from [inaudible] would be your estimator. >> Murali Haran: So the question was we're trying to predict in sort of a domain that's output where we're at, right? So at this point, that's not how you want to think about it. At this point, we're thinking about producing inference for theta that allows me to produce a model that looks like what I have today. Once I've done all of that, I'm going to use the physics of the model to extrapolate. So I'm -- yeah? >>: [inaudible]. >> Murali Haran: Correct. >>: [inaudible]. >> Murali Haran: Correct. >>: [inaudible] and then you use a full blown simulator to [inaudible]. >> Murali Haran: That's right. Calibrate now and then use this to run the model forward. That's why you need these deterministic model. If you did statistical interpolation of any kind it would just be junk. You're not doing interpolation, you're doing extrapolation [inaudible] so it's a great question. Okay. So -- all right. I think I'm going to skip over these details because we've had enough interesting questions. I want to get through to the -- and this is sort of a high level view of it. But essentially that first step that I said where you're doing an emulation of the two state spatial process, we do a lot of stuff to allow it to be flexible enough to actually interpolate the process as well. So I'm just going to skip over this. This corresponds to stage one of the emulation. Yes? >>: [inaudible] stated clearly one dimensional here that [inaudible] one dimension? >> Murali Haran: Yes. Yes. >>: And the two dimensional spatial variation. >> Murali Haran: Correct. Correct. Yeah. So the input is low dimensional here. It's one. >>: [inaudible] multi-dimensional by you [inaudible]. >> Murali Haran: Yes. So that's -- those are the details here. So this is the output Y2 and I build this hierarchical model Y1 given Y2. And by doing this in this hierarchical fashion then I can put in a lot of flexibility to the model. So I modeled the relationship between ->>: Between Y1 and Y2 also you have a Gaussian model? >> Murali Haran: Yes. Well, Y1 and Y2 individually are Gaussian. But Y1 given Y2 -sorry, Y1 given Y2 is Gaussian and Y2 is Gaussian. And there's a relationship between the two. So the -- those are details that I think I'm happy and maybe you are all happy that I am skipping over. >>: [inaudible] is that why you treat Y1 [inaudible]. >> Murali Haran: No. That discrepancy is accountable for by the term delta. Yeah. This thing is mainly to say that the two spatial processes are related. And we need to somehow allow for the fact that they may be related in sort of non-linear ways. So going back to this, the first step we fit the Gaussian process model. So fitting it means I have built this two layered model. And I have lots of parameters. And I've fiddled with those -- essentially I do maximum likelihood fit to get nice parameter values that then gives me a probability model for the computer model output. But now I want something that also allows for the fact that there's additional error that gets me to the actual observations. And once I do all of that, I can do Markov chain Monte Carlo to get to the posterior distribution that I'm interested in. This is the ultimate goal here. Okay? And then there's a lot of computational issues. So I have to make some decisions on the fly here. So I think -- I think I might skip some of these things too and just again give you a whiff of a related idea. Because these are really details that if you're working on this particular project may be particularly interesting to you. But most of you know that if you're dealing with matrix operations and the matrices are end by end matrices, the operations [inaudible]. It's all order of NQ, right? And if N here in our problem is tens of thousands and I'm doing -- and I'm having to do that at every stage of a Markov chain Monte Carlo algorithm, this is hopeless. So we actually use a reduced rank approach based on kernel mixing. And some of you -- those of who you work on Gaussian processes know what I'm talking about. But so essentially you would take a very complicated Gaussian process model and reduce -- and make the covariance into a patterned -- into a specific kind of co variant structure that lets you use a couple of these identities, the Sherman-Woodbury-Morrison and Sylvester's Theorem in order to do your computations in a much faster, slicker way. So where you had order NQ, you deal with order J cubed. And I'm just going to skip over those identities and just cut to sort of the conclusion. At the end of all of this work what is our summary? Our summary is we get Kv, a posterior distribution, the blue guy, is the final product that we get from all of this work. And we have -- you can see there's a prior distribution, the black one. And then the red and the green come from -- you know, we fiddled around and see what you can learn from the individual traces, just one tracer at a time. So Kv, it turns out, is very important for calibrating this -- this model and these kinds of models. But there are other -- there are many other parameters that those of you who follow climate science discussions or debates if you want to call them debates, this climate science literature, there's a lot of -- there's a parameter called climate sensitivity that is a huge subject of research. And, you know, there are lots of science and nature papers where essentially the conclusion is a plot like this. So -- and our methodology, you know, extends to working with, you know, any kind of parameter, essentially. So that, itself, is often an end goal for a scientist. But in this case, we can use that to build -- to do projections. Okay. So just to summarize, what we did was we obtained the probability model that connected the tracers to Kv. And the hierarchical model with Gaussian processes with pattern covariances are flexible -- is flexible and computationally tractable. And once I use this probability model I can infer Kv from observations. And then once I've inferred Kv, I can then say something about the MOC. And we find that MOC weakens slightly over the next 50 years. That's sort of the ->>: [inaudible] somewhat and running the model forward or is this more a Bayesian thing ->> Murali Haran: So we're doing -- I skipped over that detail. But we're doing something that's sort of an amalgam for computational reasons and the identifiability issue. So we do some maximum likelihood is part of it. And then the final inference that we do -- so this is entirely Bayesian. So the details involve doing some amount of fixing of parameters at the initial -- at estimates obtained from maximum likelihood and then coming back and doing Bayes. >>: [inaudible] nice posterior but it seemed, you know, kind of smooth up at the top, you know, and, you know, even the blue ones are multi-modal [inaudible]. >> Murali Haran: Yeah. Well -- >>: [inaudible] one point and ->> Murali Haran: Yeah. I'm not sure that I would -- you know, if you look at posteriors from MCMC algorithms, this is -- this is about as smooth as they usually get. So I wouldn't go so far as calling this multimodal. But I think we declare that [inaudible]. >>: [inaudible]. >> Murali Haran: Okay. All right. So let me see. How do I do the time -- I have about another -- since we started about 10 minutes late, I won't -- I won't is inflict myself on you for too long. But maybe I'll go for another 10 minutes. >>: [inaudible]. >> Murali Haran: Okay. So I'll just -- I'll just give you an overview of this. This is another -- it's a very cool problem that -- okay. Cool to me. That we started working on with infectious disease collaborators. And the reason that I wanted to talk about this as well is to give you an idea of how the tools that I discuss here may be useful in other problems as well. And sort of in a surprising way in this problem. So this is an overview of this slide. The scientist -- the infectious disease dynamics people at Penn State came to me with this model that they wanted to fit. The difference between this model and -- so they came to me with sort of a more obviously statistical problem. They have a stochastic model. We can write down a likelihood and they have data and they said we'd like to fit this model. And again, they came to me because I've worked in Monte Carlo methods. And you think -- so this shouldn't be that hard a problem. But the -- the issue with our model is -- and their model you can write down very easily the math -- you know. It's just a -- in one page you can write down the entire likelihood function. But the problem is that -- and this is a model that you can -- you can -- that's used to describe how diseases spread. In particular this is used for measles epidemics, for modeling measles dynamics. But the problem is that there are thousands of latent variables. And there are things that you don't observe. So the number of immigrants moving from one location to another, if you have measles and you move to another location, you're going to probably have -- get other people to have measles as well. To get measles as well. So you don't observe all that information. But you need that latent process in there to somehow account for that. And if you -- if you're stuck with latent variables, lots of latent variables, you know that you need to integrate those latent variables out when you're performing inference and if you have thousands of these, this is a hard problem. And the space-time data set is lovely because it's very rich and very nice from England and Wales. So it has 519 points, which is great, because you can actually learn about this disease model. Often the information is sparse. So here, this is great. But of course it presents some computational challenge. So you cannot just do something naive and have this work. And this is to give you an idea of the kind of data. We have many different cities, big cities, London and this is just telling you the number of disease cases. You know, it's biweekly data. And these are the smaller cities. And the reason I'm showing you this is, first, to give you an idea of what kind of data we have. It's time series in 952 different cities, 952 plots like this. That's the data. And you can see that it's quite different when you move from big cities to small cities, it's quite widely different. And that messes up doing the likelihood inference that we would like to do. >>: [inaudible]. >> Murali Haran: Yeah. There's a lot of seasonal stuff here. And you can guess what one of the seasonal things will be. Measles has a lot to do with school kids. And so when schools are in session then there's a spike and so on. Okay. So there's some issues here. The stochastic model is expensive -- results in a likelihood that's expensive to evaluate. It's not very expensive. But it's just expensive enough that you can't do -- cannot do Bayesian inference easily. And I'm going to skip over this. Does anybody here know about approximate Bayesian computing? No. Then I'm going to skip over this. Essentially there's no other approach for doing this easily, so we developed a very simple solution we thought. We thought we had solved this problem if we do a grid base Markov chain Monte Carlo and we discretize the parameter space. Maybe some of you have used these tricks. And so we do a lot of computing in parallel ahead of time. And then that lets us do medium likelihood and Bayes. But then the problem is -we thought we would solve this problem we were done. But we fit this model and it doesn't fit the scientifically relevant features of the data. And then if you do sort of a controlled experiment where we know what the input was and we have the output corresponding to that -- the parameters that were used, we don't recover the parameters that we were used. So that's bad news. Right? But the really crucial thing is it does not fit scientifically relevant features of the data. So what are scientifically relevant features? These are just a couple of examples. These are things that the biologist told us they would like their models to mimic. So they have a fitted model. They would like the fitted model to reproduce a couple of these characteristics. And all I've done here is I have predictions from our model versus the actual observations for those summary characteristics. And you can see that the fit is pretty lousy. And so they're unhappy when they see this. And so how do we solve -- we thought we would solve the problem. But it turns out that likelihood -- simple likelihood or Bayes doesn't really solve the problem. So we thought why not go and work directly with the summary statistics, right? Instead of -- maximum likelihood does not understand what features biologists are interested in capturing. So we thought how about we move away from maximum likelihood and Bayes and we move into this world of sort of this features or summary statistics based on the data? And so what we then did was we treated as data these summary statistics. And now, you can go back to thinking about this problem in the way that we thought about the computer model problem from the climate science research, which is for each input -- I can run this model, right? At each input I have an output that I now summarize, have summary statistics. And the summary statistics are now like my output corresponding to each theta. And then I have observations. And I can use summary statistics based on those observations. And then now I can use Gaussian process based ideas just like I did before. That's a bit of a lie. There's lots of details here. There's some computational issues. And we use some tricks to get the computing to work out right. But the end result of all of that is that we produce parameter estimates that then result in the model that fits the data, the things that they care about much, much better. >>: [inaudible] parameters that [inaudible] I'm not sure [inaudible] care about. But so for instance for this data, the slices that you have, the like slices in time and space, right, so now you can create an interpolate [inaudible] approach and take a new unobserved point in time and space and then create a value. But is that the theta you care about? >> Murali Haran: Okay. So here's the funny thing. We can no longer -- our Gaussian process no longer interpolates the space-time process. In the previous method, we can actually interpolate that. >>: Right. >> Murali Haran: Here, the interpolation we're doing are of these features. >>: [inaudible]. >> Murali Haran: Yes. So if you give me -- if you give me a new parameter input, I cannot give you all of these time series directly. But based on the Gaussian process emulator. But what I can do is I can give you the summary statistics corresponding to that. But it doesn't matter because if I actually have the parameters, the stochastic model is way easier -- you know, I can run the stochastic model. Which I couldn't do with the climate science things. If I have the input space now, I have the distribution of the parameters, I can use that to run the original stochastic model. So I can actually still do sort of predictions or projections or whatever I want based on those models. >>: Once you get data that you care about. >> Murali Haran: Yes. >>: Okay. But so -- okay. So now your GP is on the summary statistics. But still what is your dependent variable that you're interpolating between? Are they time equals then now or -- >> Murali Haran: No. So the thing that I'm doing, the interpolation across is that input theta. So I skip ->>: [inaudible]. >> Murali Haran: Yeah. Yeah. Yeah. So I apologize, you know, but in my skipping maybe I didn't spend enough time just laying out the fact that theta, just like in the other case the parameter's controlling the climate -- the process, here again theta -- the same, you know, red color things, all right, this time they control the dynamics of the spread of the disease. So once I know what theta is or I have an idea what theta is based on data, I can actually go back and fit -- and say something about the process in likely theta values. Was there another -- yeah? >>: Should I just take these curves that you show here, are these really the parameter they care about or are these the sanity checks that they use to check any model to see if it [inaudible]. >> Murali Haran: These are the -- I'm not sure the distinction between the two. So these are things that they want -- want to make sure that the model actually reproduces this. Because they don't think -- they would also like to see the model reproduce the time series. But this is -- if the model doesn't get this right, it's really missing what they think are sort of key features of the biology. Right? But they would also like to see the time series matched. But now if you think about it, if you have 952 time series, you need to come up with some kind of metric that tells you how well you're doing. And it's not very obvious how to do that because there's complications, there's lots of small cities, and you can be way off in many, many small cities but do very well in the bigger cities and still get the general picture right, and you're willing to live with the variability with the smaller city prediction. So if you did something naive you wouldn't actually be fitting it in the right way. So -- which is why this provides a natural way to think about sort of a summary measure that we're interested in. And that's why we work with this. Yes? >>: [inaudible] original data is basically saying that you care about log loss, the [inaudible]. >> Murali Haran: Correct. >>: [inaudible] and here basically the [inaudible] saying into the function I care about is that matching [inaudible]. >> Murali Haran: That's a very nice summary. So it's such a nice summary I'm going to repeat it. So the -- what he's reminding us is that maximum likelihood -- you know, in some sense if you want to think about it in terms of loss, loss functions on the parameter space, it's telling you -- it's doing something very naive in terms of imposing loss, right? There's a squared function. Then that's not necessarily what you want. You would rather have the loss be placed on these guys. And that's what -- that's what we end up doing. And so -- but -- so it makes sense that our model then fits this. Because now we're actually trying to fit these guys, rather than trying to fit the original thing. That makes sense. But the very surprising and bizarre thing is that if we do -- so this is not surprising. The weird thing is that if we actually do a controlled experiment where we know what the input was, where we know what was used to simulate the data, and then we hide it from our procedure, this approach recovers it better than the Bayes or the usual likelihood based approach. And that has -- we don't know exactly the details why and the mathematics of it. We don't know exactly how to figure that out. But we think it has a lot to do with the fact that these smaller cities have a really a disproportionately large say in how the maximum likelihood approach works and sort of noisy things in the smaller cities really make -- screw up your inference. And by looking at these summaries, we're somehow making it do less ridiculous things. It's -- this is the surprising part of it. We did not expect -- we expected to get this, but we did not expect that we would also get better parameter estimates. So let me just summarize and we can come back if you have more questions. So just an overall summary. So the Gaussian process we think -- I think are very powerful tool and when the likelihood is implicit and simulating from the model is expensive. And these are very useful for deterministic and stochastic models. And the first case was deterministic, second was stochastic. And we can perform inference based on scientifically important features of the data. And this is what I talked about in the second -- in the disease modeling approach. And they're computationally expedient. And, in fact, I didn't say much about this, but -except what I just said was surprising. May actually improve inference and prediction in sort of the regular way that you would think about doing inference and prediction. And so this is just acknowledging collaborators. The guys in blue are PhD students or former PhD students. And so these are just acknowledging support for all of this work. So thank you very much. [applause]. >>: First of all, this is a comment and then any questions. So if anyone would like to [inaudible] there are a couple spots left to schedule. They'll be here for the rest of the day. And then are there any other questions? >>: [inaudible] comment. I think [inaudible] saying this at the end. But the thing that you saw with the fact that you're recovering data better now with this model must mean that the summary statistics are also kind of discounting those smaller [inaudible]. >> Murali Haran: Yes. Yes. Exactly. >>: The cleverness of the summary statistics, however they're constructed, ends up [inaudible]. >> Murali Haran: That's right. >>: [inaudible]. >> Murali Haran: That's right. And recently there have been a couple of other people working on very different projects who have noticed things like what we've noticed. Maybe there were people before who noticed that as well, but I wasn't aware of it. But summarizing things in a particular way sometimes -- even if, you know -- in statistics if you're in the exponential family, if you're dealing with sufficient statistics, then it makes no difference, right? But in these sort of complicated models, somehow doing -- looking at sort of particular summaries does better than working with the original data, which is ->>: [inaudible] human engineering and intuition that went into those [inaudible]. [brief talking over]. >>: I mean, it's kind of like I think as machine learning people we're kind of like oh, always go to the raw data because that's where the real truth is. >> Murali Haran: Right. >>: But here you have a case where, in fact, people have looked at and no, no, no, this is the type that really matters. You know, leveraging that has given you this [inaudible]. >> Murali Haran: Absolutely. And, in fact, you know, that -- that's -- that last part of my talk on disease dynamics, I -- you know, I give one hour talks on that subject alone. And there, one of the things that I would often highlight is the fact that this gives us a sort of a nice way to directly talk to scientists. I mean, this gets scientists excited because it ->>: [inaudible]. >> Murali Haran: It actually tells them that their notions of what is important is being built directly that the inference. >>: Let's thank Murali again. >> Murali Haran: Thank you so much. [applause]