PART 1. First steps in Bayesian statistics

advertisement
Bayesian methods for parameter estimation and data
assimilation for crop models
PART 1. First steps in Bayesian statistics
SLIDE 1
Hi. Here we go. We’ll try to move fast and nevertheless explain slowly and clearly. To
reconcile these seemingly contradictory objectives, we implement two strategies;
1. We have a particular focus; Bayesian methods for crop models. This is not a general
course in Bayesian statistics. That limits the amount of material to cover.
2. This mini-course works on two levels; the slides and the text. If the information on
the slides is sufficient for you, great. When it isn’t, there is a more detailed explanation in the
text.
SLIDE 2
You might be interested in a brief introduction to the context of Bayesian statistics in
the world of statistics today.
Statistics is basically concerned with drawing conclusions from data. These
conclusions can take the form of estimates of parameter values, predictions of future results,
hypothesis testing, confidence intervals, etc.
We talk of statistics and statistical procedures, but in fact there are two different
schools of statistics; one is called frequentist statistics and the other is called Bayesian
statistics. They are both concerned with drawing conclusions from data, but they emphasize
different questions, use different procedures and can lead to different results.
For many years there has been a sort of rivalry between these two schools, each
claiming that the other has very serious flaws and that therefore one should choose their
school. The Bayesian school had one drawback that was extremely damaging – the required
calculations were essentially impossible except in relatively simple cases. This was probably
the major reason that frequentist statistics became by far the most common school of thought.
If you have had a basic statistics course, it is no doubt frequentist statistics that you have
learned.
This major drawback has now all but disappeared, thanks to new algorithms and more
computing power. You can now do Bayesian calculations in even very complex situations.
This has changed, and is continuing to change, the relative popularity of the two schools.
There is more and more interest in and use of Bayesian procedures. The advantages of the
Bayesian school, which are important, are convincing more and more scientists in many
different fields to use these methods. Their use in crop modelling is embryonic, but we think
Bayesian methods in the future will play a very important role in this field. That is the
rationale behind our investment in Bayesian methods and our decision to try to explain these
methods simply in this mini-course.
SLIDE 6
Here very briefly are some general characteristics that differentiate between Bayes
methods and frequentist methods. We will go into this in more detail later in the course.
1. Frequentist methods are based solely on the data. Bayesian methods involve
combining data and “prior” information, where “prior” information is often expert opinion or
literature results.
2. The basic relation between data and parameters in frequentist statistics is the
likelihood equation, which quantifies how likely it is to obtain the data values that were
observed, for different values of the parameters. The likelihood is also central in Bayesian
procedures, but is used in Bayes’ Theorem which combines the likelihood with the “prior”
information.
3. Bayesians treat parameters as random variables. The main objective is to obtain a
probability distribution for the parameter vector, and this distribution represents our
uncertainty about the value of the parameter. In frequentist methods on the other hand,
parameters are considered fixed not random quantities, but those fixed values are unknown.
The main objective of relevant statistical procedures is to derive estimates of these unknown
values, and uncertainty measures for the estimates. The distributions of interest represent the
uncertainty in the estimators of the parameters and not of the parameters directly. The
uncertainty in frequentist statistics represents the variability that would arise if the experiment
were repeated many times, not our uncertainty about the parameter values.
SLIDE 7
Statistics is based on probability theory, which deals with random variables. Both
frequentist and Bayesian statistics accept this foundation and all its consequences. That is, the
basic mathematical theorems are identical in both cases. That is lucky for us, because it means
that we don’t have to go into a deep discussion of probability to understand the differences
between the two schools.
However, there is a bit of simple probability theory that we will need. We need to
understand the basic concepts of conditional probability, joint probability and marginal
probability and the relationships between them. That’s what we’ll explain on the following
slides. Among the simple formulae we’ll derive is Bayes’ Theorem.
SLIDES 8, 9 JOINT PROBABILITY
Consider two random variables, A and B. A for example might be the random variable
“rain next January 2 in Toulouse” with two possible values “yes” and “no”. B might be “rain
next January 2 in Toulouse”, again with two possible values “yes” and “no”.
A  rain next Jan 1
B  rain next Jan 2
The probability that it rains on January 2 AND on January 1 is written
P(A=”yes”,B=”yes”). This is the joint probability of both events occurring. To estimate this
probability we could look at weather data for past years (see table below). Let n be the
number of years with records. Here n=10. Let nA=yes,B=yes be the number of years with rain on
both January 1 and January 2. From the table, nA=yes,B=yes = 3. Then we could estimate the joint
probability by nA=yes,B=yes /n = 3/10.
In the general case we could write P( A  ai , B  bi ) or more compactly P ( A, B ) to
indicate the probability that A takes some particular value and B some particular value.
Year
Rain Jan 1?
Rain Jan 2?
1
yes
yes
2
no
yes
3
yes
no
4
yes
yes
5
yes
yes
6
no
no
7
yes
no
8
no
no
9
no
no
10
yes
no
SLIDES 10, 11 CONDITIONAL PROBABILITY
The probability that it rains on Jan 2 GIVEN that it rains on January 1 is written
P(B=”yes”|A=”yes”). The vertical bar is read as “given that”. This is called a conditional
probability. To estimate this probability we would count the years with rain on January 1, and
only for those years count the number of years with rain on January 2. In our case the number
of years with rain on January 1 is nA=yes=6 and for those years the number of years where it
also rained on January 2 is nB=yes|A=yes= 3. The conditional probability would then be
estimated as nB=yes|A=yes/nA=yes=3/6.
The general notation would be P( B  bi | A  ai ) or more compactly P( B | A) to
indicate the probability that B takes some particular value given that A takes some particular
value.
SLIDES 12, 13 MARGINAL PROBABILITY
Finally, we might be interested in just rain on one day. The probability that it rains for
example on January 2 is written P(B=”yes”) and is called the total or marginal probability
that B=”yes”. To estimate the marginal probability we would just count the years with rain on
Jan 2 and divide by the total number of years with records. From the table this gives
nB=yes/n=4/10. Of course we could also look at the marginal probability for A. The marginal
probability that it rains on Jan 1 is P(A=”yes”) which could be estimated by nA=yes/n=6/10.
The general notation is P( A  ai ) or P( B  bi ) or more compactly P( A) or P( B) .
SLIDE 14 RELATIONSHIPS BETWEEN PROBABILITIES
The different probabilities (joint, conditional, marginal) are interrelated. We now write
some general equations that relate these different probabilities to one another. These are
intuitively easy to understand, and easily illustrated with our Toulouse rainfall example.
Using the rainfall example, the conditional probability can be written
P( A  yes, B  yes )  P( B  yes | A  yes ) P( A  yes )
In words, the probability that it rains on both Jan 1 AND on Jan 2 equals the probability that it
rains on Jan 1 times the probability that it rains on Jan 2 GIVEN that is rains on Jan 1. In our
case the left hand side is 3/10 and the right hand side is (3/6)(6/10)=3/10, so the equality
holds.
In our general notation this relation becomes
P( A, B)  P( B | A) P( A) (1.1)
If we exchange A and B the equation becomes
P( B, A)  P( A | B) P( B) .
However, P( A, B)  P( B, A) since the order of writing the random variables in the joint
probability is unimportant. This implies that
P( B | A) P( A)  P( A | B) P( B) (1.2).
Returning to our rainfall example, we can relate conditional and marginal probabilities
as
P( B  yes)  i  P( B  yes | A  ai ) P( A  ai )
where the sum is over all possible values of A. In our case A can take only two values, “yes”
and “no”, so the equation is
P( B  yes )  P( B  yes | A  yes ) P( A  yes )  P( B  yes | A  no) P( A  no)
 P( B  yes, A  yes )  P( B  yes, A  no)
In words, the total probability that it rains on Jan 2 is the sum of the probability that it rains on
Jan 1 and then rains also on Jan 2 plus the probability that it doesn’t rain on Jan 1 and then
rains on Jan 2. For our data the left hand side is 4/10 and the right hand side is
(3/6)(6/10)+(1/4)(4/10)=4/10, so again the equality is seen to hold.
In our general notation this relation becomes
P( B)  i P( B | A  ai ) P( A  ai )
For P(A) we have equivalently
P( A)  i P( A | B  bi ) P( B  bi ) (1.3)
In the case of continuous variables rather than discrete variables, the probability P is
replaced by a probability distribution function f and the sum becomes an integral. For two
continuous random variables X and Y eq. (1.3) becomes
f ( X )    f ( X | Y ) f (Y ) dY .
SLIDE 15 BAYES’ THEOREM
We now have all we need to derive Bayes’ theorem. Rearranging eq. (1.2) gives
P( B | A)  P( A | B) P( B) /P( A)
 P( A | B) P( B) / i P( A | B  bi ) P( B  bi )
(1.4)
In going from the first line to the second we have used eq. (1.3). The equivalent for
continuous random variables is
f ( X | Y )  f (Y | X ) f ( X ) /f (Y )
 f (Y | X ) f ( X ) /  f (Y | X ) f ( X ) dX
Eq. (1.4) is Bayes’ theorem. This formula was derived by the reverend Thomas Bayes
and published in 1763. Hardly a recent discovery. This equation is accepted by both
frequentists and Bayesians since it follows directly from basic principles of probability. It is
the way that it is used that distinguishes Bayesians.
Note that basically the equation is a way of calculating one conditional probability,
P(Y|X), in terms of the conditional probability in the other direction, P(X|Y). It will be useful
then when we are interested in P(Y|X) but what we know or can easily calculate is P(X|Y).
SLIDES 16, 17 EXAMPLE OF THE USE OF BAYES
THEOREM
It really is important for the rest of this course to understand the notions of conditional,
joint and marginal probability. To help out, we’ll give another example. This will also
illustrate that Bayes’ theorem can be useful in real life.
We have copied this example from a book and can’t find the reference. We hope the
unacknowledged author will forgive us. The author explains that he had lunch with a
colleague and his colleague’s pregnant wife. The wife said that she had just come from the
doctor and had learned that she was pregnant with twins, two boys. What was the probability,
she wondered, that the twins were identical twins? A question that is just up the alley of a
statistician (or anybody else) familiar with Bayes’ theorem.
B is now the random variable “twins are identical or not” and takes two possible
values, “identical ” and “not identical”. A is the random variable “sexes of the twins” and the
possible values are “boy, girl”, “girl,boy”, “boy,boy” and “girl, girl”. We are interested in the
conditional probability P(B=identical|A=two boys). It is not immediately obvious what the
value is. However, we can easily calculate the conditional probability in the other direction,
namely P(A=two boys|B=identical). If the twins are identical, they are either two boys or two
girls. It is reasonable to assume that the two cases are equally probable (based on real-world
experience) and so P(A=two boys|B=identical twins)=0.5. The problem then is to calculate a
conditional probability of B given A from the conditional probability A given B. By Bayes’
theorem, eq. (1.4);
P( B  identical | A  two boys)  P( A  two boys) B  identical) P( B  identical ) / P( A  two boys)
(1.5)
To evaluate this expression we need P(B) and P(A). P(B) is the overall probability of
having identical twins (among pregnancies with twins). Real-world data shows that about 1/3
of all twins are identical twins. So P(B)=1/3. This is our “prior” information. If we knew
nothing about the sexes of the children (no experimental data for this specific case) and were
asked for the probability that the twins are identical, we would say “1/3”.
We can relate P(X) to quantities we know using eq. (1.3) which is here
P(two boys)  P(two boys identical ) P(identical )  P(two boys not identical ) P(not identical )
 (1/ 2)(1/ 3)  (1/ 4)(2 / 3)  1/ 3
The value P(two boys not identical )  1/ 4 follows from the fact that in the case of non
identical twins the four possible values of X are assumed to be equally likely, so each has
probability ¼. The value P(not identical)=2/3 follows from the fact that 1/3 of twins are
identical, so 2/3 are not. Substituting into eq. (1.5) gives
P( B  identical | A  two boys )  (1/2)(1/3)/(1/3)=1/2
The probability that the woman’s twins are identical twins, given her knowledge that
she is carrying two boys, is one half. The result combines experimental data (“two boys”)
with prior information (“1/3 of all twins are identical twins”).
SLIDE 18 BAYES’ THEOREM FOR MODEL PARAMETERS
To make the connection with parameter estimation for crop models, we note y the
experimental data and θ the vector of model parameters. Then Bayes’ theorem becomes
f ( y )  f ( y  ) f ( ) / f ( y )
 f ( y  ) f ( ) /  f ( y  ) f ( )d
On the right hand side f ( y  ) is the probability of obtaining the observed data y if the
parameter values are  . This is the likelihood function, which can usually be written down
easily once the statistical model is formulated. f ( ) is the prior distribution, which contains
our prior information about the parameter. Finally f ( | y ) is called the posterior distribution
of the parameter values. It is posterior in the sense that it is obtained after we update the prior
information using the data.
This is the basic equation of Bayesian statistics. The rest of this mini-course
essentially consists of interpreting, applying or solving that equation.
Download