Bayesian methods for parameter estimation and data assimilation for crop models PART 1. First steps in Bayesian statistics SLIDE 1 Hi. Here we go. We’ll try to move fast and nevertheless explain slowly and clearly. To reconcile these seemingly contradictory objectives, we implement two strategies; 1. We have a particular focus; Bayesian methods for crop models. This is not a general course in Bayesian statistics. That limits the amount of material to cover. 2. This mini-course works on two levels; the slides and the text. If the information on the slides is sufficient for you, great. When it isn’t, there is a more detailed explanation in the text. SLIDE 2 You might be interested in a brief introduction to the context of Bayesian statistics in the world of statistics today. Statistics is basically concerned with drawing conclusions from data. These conclusions can take the form of estimates of parameter values, predictions of future results, hypothesis testing, confidence intervals, etc. We talk of statistics and statistical procedures, but in fact there are two different schools of statistics; one is called frequentist statistics and the other is called Bayesian statistics. They are both concerned with drawing conclusions from data, but they emphasize different questions, use different procedures and can lead to different results. For many years there has been a sort of rivalry between these two schools, each claiming that the other has very serious flaws and that therefore one should choose their school. The Bayesian school had one drawback that was extremely damaging – the required calculations were essentially impossible except in relatively simple cases. This was probably the major reason that frequentist statistics became by far the most common school of thought. If you have had a basic statistics course, it is no doubt frequentist statistics that you have learned. This major drawback has now all but disappeared, thanks to new algorithms and more computing power. You can now do Bayesian calculations in even very complex situations. This has changed, and is continuing to change, the relative popularity of the two schools. There is more and more interest in and use of Bayesian procedures. The advantages of the Bayesian school, which are important, are convincing more and more scientists in many different fields to use these methods. Their use in crop modelling is embryonic, but we think Bayesian methods in the future will play a very important role in this field. That is the rationale behind our investment in Bayesian methods and our decision to try to explain these methods simply in this mini-course. SLIDE 6 Here very briefly are some general characteristics that differentiate between Bayes methods and frequentist methods. We will go into this in more detail later in the course. 1. Frequentist methods are based solely on the data. Bayesian methods involve combining data and “prior” information, where “prior” information is often expert opinion or literature results. 2. The basic relation between data and parameters in frequentist statistics is the likelihood equation, which quantifies how likely it is to obtain the data values that were observed, for different values of the parameters. The likelihood is also central in Bayesian procedures, but is used in Bayes’ Theorem which combines the likelihood with the “prior” information. 3. Bayesians treat parameters as random variables. The main objective is to obtain a probability distribution for the parameter vector, and this distribution represents our uncertainty about the value of the parameter. In frequentist methods on the other hand, parameters are considered fixed not random quantities, but those fixed values are unknown. The main objective of relevant statistical procedures is to derive estimates of these unknown values, and uncertainty measures for the estimates. The distributions of interest represent the uncertainty in the estimators of the parameters and not of the parameters directly. The uncertainty in frequentist statistics represents the variability that would arise if the experiment were repeated many times, not our uncertainty about the parameter values. SLIDE 7 Statistics is based on probability theory, which deals with random variables. Both frequentist and Bayesian statistics accept this foundation and all its consequences. That is, the basic mathematical theorems are identical in both cases. That is lucky for us, because it means that we don’t have to go into a deep discussion of probability to understand the differences between the two schools. However, there is a bit of simple probability theory that we will need. We need to understand the basic concepts of conditional probability, joint probability and marginal probability and the relationships between them. That’s what we’ll explain on the following slides. Among the simple formulae we’ll derive is Bayes’ Theorem. SLIDES 8, 9 JOINT PROBABILITY Consider two random variables, A and B. A for example might be the random variable “rain next January 2 in Toulouse” with two possible values “yes” and “no”. B might be “rain next January 2 in Toulouse”, again with two possible values “yes” and “no”. A rain next Jan 1 B rain next Jan 2 The probability that it rains on January 2 AND on January 1 is written P(A=”yes”,B=”yes”). This is the joint probability of both events occurring. To estimate this probability we could look at weather data for past years (see table below). Let n be the number of years with records. Here n=10. Let nA=yes,B=yes be the number of years with rain on both January 1 and January 2. From the table, nA=yes,B=yes = 3. Then we could estimate the joint probability by nA=yes,B=yes /n = 3/10. In the general case we could write P( A ai , B bi ) or more compactly P ( A, B ) to indicate the probability that A takes some particular value and B some particular value. Year Rain Jan 1? Rain Jan 2? 1 yes yes 2 no yes 3 yes no 4 yes yes 5 yes yes 6 no no 7 yes no 8 no no 9 no no 10 yes no SLIDES 10, 11 CONDITIONAL PROBABILITY The probability that it rains on Jan 2 GIVEN that it rains on January 1 is written P(B=”yes”|A=”yes”). The vertical bar is read as “given that”. This is called a conditional probability. To estimate this probability we would count the years with rain on January 1, and only for those years count the number of years with rain on January 2. In our case the number of years with rain on January 1 is nA=yes=6 and for those years the number of years where it also rained on January 2 is nB=yes|A=yes= 3. The conditional probability would then be estimated as nB=yes|A=yes/nA=yes=3/6. The general notation would be P( B bi | A ai ) or more compactly P( B | A) to indicate the probability that B takes some particular value given that A takes some particular value. SLIDES 12, 13 MARGINAL PROBABILITY Finally, we might be interested in just rain on one day. The probability that it rains for example on January 2 is written P(B=”yes”) and is called the total or marginal probability that B=”yes”. To estimate the marginal probability we would just count the years with rain on Jan 2 and divide by the total number of years with records. From the table this gives nB=yes/n=4/10. Of course we could also look at the marginal probability for A. The marginal probability that it rains on Jan 1 is P(A=”yes”) which could be estimated by nA=yes/n=6/10. The general notation is P( A ai ) or P( B bi ) or more compactly P( A) or P( B) . SLIDE 14 RELATIONSHIPS BETWEEN PROBABILITIES The different probabilities (joint, conditional, marginal) are interrelated. We now write some general equations that relate these different probabilities to one another. These are intuitively easy to understand, and easily illustrated with our Toulouse rainfall example. Using the rainfall example, the conditional probability can be written P( A yes, B yes ) P( B yes | A yes ) P( A yes ) In words, the probability that it rains on both Jan 1 AND on Jan 2 equals the probability that it rains on Jan 1 times the probability that it rains on Jan 2 GIVEN that is rains on Jan 1. In our case the left hand side is 3/10 and the right hand side is (3/6)(6/10)=3/10, so the equality holds. In our general notation this relation becomes P( A, B) P( B | A) P( A) (1.1) If we exchange A and B the equation becomes P( B, A) P( A | B) P( B) . However, P( A, B) P( B, A) since the order of writing the random variables in the joint probability is unimportant. This implies that P( B | A) P( A) P( A | B) P( B) (1.2). Returning to our rainfall example, we can relate conditional and marginal probabilities as P( B yes) i P( B yes | A ai ) P( A ai ) where the sum is over all possible values of A. In our case A can take only two values, “yes” and “no”, so the equation is P( B yes ) P( B yes | A yes ) P( A yes ) P( B yes | A no) P( A no) P( B yes, A yes ) P( B yes, A no) In words, the total probability that it rains on Jan 2 is the sum of the probability that it rains on Jan 1 and then rains also on Jan 2 plus the probability that it doesn’t rain on Jan 1 and then rains on Jan 2. For our data the left hand side is 4/10 and the right hand side is (3/6)(6/10)+(1/4)(4/10)=4/10, so again the equality is seen to hold. In our general notation this relation becomes P( B) i P( B | A ai ) P( A ai ) For P(A) we have equivalently P( A) i P( A | B bi ) P( B bi ) (1.3) In the case of continuous variables rather than discrete variables, the probability P is replaced by a probability distribution function f and the sum becomes an integral. For two continuous random variables X and Y eq. (1.3) becomes f ( X ) f ( X | Y ) f (Y ) dY . SLIDE 15 BAYES’ THEOREM We now have all we need to derive Bayes’ theorem. Rearranging eq. (1.2) gives P( B | A) P( A | B) P( B) /P( A) P( A | B) P( B) / i P( A | B bi ) P( B bi ) (1.4) In going from the first line to the second we have used eq. (1.3). The equivalent for continuous random variables is f ( X | Y ) f (Y | X ) f ( X ) /f (Y ) f (Y | X ) f ( X ) / f (Y | X ) f ( X ) dX Eq. (1.4) is Bayes’ theorem. This formula was derived by the reverend Thomas Bayes and published in 1763. Hardly a recent discovery. This equation is accepted by both frequentists and Bayesians since it follows directly from basic principles of probability. It is the way that it is used that distinguishes Bayesians. Note that basically the equation is a way of calculating one conditional probability, P(Y|X), in terms of the conditional probability in the other direction, P(X|Y). It will be useful then when we are interested in P(Y|X) but what we know or can easily calculate is P(X|Y). SLIDES 16, 17 EXAMPLE OF THE USE OF BAYES THEOREM It really is important for the rest of this course to understand the notions of conditional, joint and marginal probability. To help out, we’ll give another example. This will also illustrate that Bayes’ theorem can be useful in real life. We have copied this example from a book and can’t find the reference. We hope the unacknowledged author will forgive us. The author explains that he had lunch with a colleague and his colleague’s pregnant wife. The wife said that she had just come from the doctor and had learned that she was pregnant with twins, two boys. What was the probability, she wondered, that the twins were identical twins? A question that is just up the alley of a statistician (or anybody else) familiar with Bayes’ theorem. B is now the random variable “twins are identical or not” and takes two possible values, “identical ” and “not identical”. A is the random variable “sexes of the twins” and the possible values are “boy, girl”, “girl,boy”, “boy,boy” and “girl, girl”. We are interested in the conditional probability P(B=identical|A=two boys). It is not immediately obvious what the value is. However, we can easily calculate the conditional probability in the other direction, namely P(A=two boys|B=identical). If the twins are identical, they are either two boys or two girls. It is reasonable to assume that the two cases are equally probable (based on real-world experience) and so P(A=two boys|B=identical twins)=0.5. The problem then is to calculate a conditional probability of B given A from the conditional probability A given B. By Bayes’ theorem, eq. (1.4); P( B identical | A two boys) P( A two boys) B identical) P( B identical ) / P( A two boys) (1.5) To evaluate this expression we need P(B) and P(A). P(B) is the overall probability of having identical twins (among pregnancies with twins). Real-world data shows that about 1/3 of all twins are identical twins. So P(B)=1/3. This is our “prior” information. If we knew nothing about the sexes of the children (no experimental data for this specific case) and were asked for the probability that the twins are identical, we would say “1/3”. We can relate P(X) to quantities we know using eq. (1.3) which is here P(two boys) P(two boys identical ) P(identical ) P(two boys not identical ) P(not identical ) (1/ 2)(1/ 3) (1/ 4)(2 / 3) 1/ 3 The value P(two boys not identical ) 1/ 4 follows from the fact that in the case of non identical twins the four possible values of X are assumed to be equally likely, so each has probability ¼. The value P(not identical)=2/3 follows from the fact that 1/3 of twins are identical, so 2/3 are not. Substituting into eq. (1.5) gives P( B identical | A two boys ) (1/2)(1/3)/(1/3)=1/2 The probability that the woman’s twins are identical twins, given her knowledge that she is carrying two boys, is one half. The result combines experimental data (“two boys”) with prior information (“1/3 of all twins are identical twins”). SLIDE 18 BAYES’ THEOREM FOR MODEL PARAMETERS To make the connection with parameter estimation for crop models, we note y the experimental data and θ the vector of model parameters. Then Bayes’ theorem becomes f ( y ) f ( y ) f ( ) / f ( y ) f ( y ) f ( ) / f ( y ) f ( )d On the right hand side f ( y ) is the probability of obtaining the observed data y if the parameter values are . This is the likelihood function, which can usually be written down easily once the statistical model is formulated. f ( ) is the prior distribution, which contains our prior information about the parameter. Finally f ( | y ) is called the posterior distribution of the parameter values. It is posterior in the sense that it is obtained after we update the prior information using the data. This is the basic equation of Bayesian statistics. The rest of this mini-course essentially consists of interpreting, applying or solving that equation.