Bayesian statistics Probabilities for everything

advertisement
Bayesian statistics
Probabilities for everything
Different views on probability
Frequentistic: Probabilities are there to tell us about long-term
frequencies. They are objective, solely properties of nature
(aleatoric).
Bayesian: Probabilities are there so that we can sum up our
knowledge about things we are uncertain about. They are
therefore found in the interplay between the subject of our study
and ourselves.
In Bayesian statistics, probabilities are subjective, but can obtain
an air of weak objectivity (intersubjectivity) if most people can
agree that these probabilities sum up their collective knowledge
(example: dice).
Bayes formula
Both latent variables and parameters are treated using probability
theory. We treat everything with the same tool, conditional
probabilities. In a sense, we only have observations and latent
variables.
Knowledge is updated using Bayes theorem:
f ( D |  ) f ( )
f ( | D) 

f ( D)
f ( D |  ) f ( )
 f ( D |  ' ) f ( ' )d '
For discrete variables,
replace the probability
density, f, with probability
and integrals with sums.
The probability density f() is called the prior and is meant to
contain whatever information we have about  before the data,
in the form of a probability density. Restrictions on the possible
values the parameters can take are placed here. More on this
later.
Bayes formula
Bayes theorem: f ( | D) 
f ( D |  ) f ( )

f ( D)
f ( D |  ) f ( )
 f ( D |  ' ) f ( ' )d '
• This probability density, f(|D), is called the posterior
distribution. It sums up everything we know about the
parameters, , after dealing with the data, D. Estimates,
parameter uncertainty, derived quantities, decision making and
model testing all follows from this.
• An estimate can be formed using the expectation, median or
mode of the posterior distribution.
• Parameter uncertainty can be described using credibility
intervals. A 95% credibility interval (a,b) has the property
Pr |D(a<<b)=95%. I .e. after the data, you have 95%
probability of the parameter having a value inside this interval.
• The distribution f(D) will turn out to be a (the?) problem.
Bayesian statistics –
Pros
/
Cons
•
Restrictions and insights coded in the prior from
the biology can help the inference.
•
•
Since you need to give a prior, you are actually
forced to think about the meaning of your model.
For some, Bayesian probabilities make more
sense than frequentist ones.
You don’t have to take a stance on whether an
unknown quantity is fundamentally stochastic or
not.
You get the parameter uncertainty “for free”.
It can give answers where the classical
approach has none (such as the occupancy
probability conditioned only on the data).
You are actually allowed to talk about the
probability that a parameter is found in a given
interval and the probability of a given zerohypothesis. (This is often how confidence intervals and p-
•
•
•
•
•
•
values are incorrectly interpreted).
•
Understanding the output of a Bayesian analysis
is often easier than for frequentist outputs.
•
•
•
•
You *have* to supply a prior. That prior
can be criticized. Making a prior that’s
hard to criticize is hard.
Thinking about the meaning of a model
parameter is extra work.
For some, frequentist probabilities
make more sense than Bayesian ones.
Some think distinguishing between
parameters and latent variables is
important.
Sometimes you’re only interested in
estimates.
Bayesian statistics is subjective
(though it can be made inter-subjective
with hard work).
Bayesian statistics vs frequentist
statistics – the practical issue
When the model or analysis complexity is below a certain limit,
frequentist methods will be easier while above that threshold,
Bayesian analysis is easier.
Work/
effort
Frequentist
Bayesian
Complexity
Graphical modelling - occupancy
All unknown quantities are now on equal footing. All
dependencies are described by conditional probabilities, with
marginal probabilities (priors) at the top nodes.

Parameters ():
Latent variables:
1
2
3
………
p
Prior: f()=1, f(p)=1
A
Pr(i=1|)=
(,p~U(0,1), i.e. uniform
between 0 and 1).
The area occupancies
are independent given
the occupancy rate.
Data: x1,1
x1,2 x1,3 ……… x1,n1
Pr(xi,j=1 | i=1,)=p. Pr(xi,j=0 | i=1,)=1-p,
Pr(xi,j=1 | i=0,)=0. Pr(xi,j=0 | i=0,)=1
Hyper-parameters
If your prior has a parametric form, the stuff you put as values in these
forms are the hyper-parameters. For instance, the uniform distribution
from zero to one is a special case of a uniform prior from a to b.
Hyper-parameters:
a,b
ap,bp

p
Parameters ():
Latent variables:
1
2
3
………
Prior: ~U(a,b) and
p ~U(ap,bp). Since p
and  are rates, we
have set a=ap=0,
b=bp=1.
A
The hyper-parameters
are fixed. They are
there to sum up our
prior knowledge. If you
Data: x1,1
x1,2 x1,3 ……… x1,n1
start doing inference on them,
they are parameters, not
hyper-parameters.
Download