Statistical modelling and latent variables. Constructing models based on insight and motivation

advertisement
Statistical modelling and latent
variables.
Constructing models based on insight and motivation
Statistical modelling - why?
A statistical model describes a possible distribution of incoming
data, given some (unknown) parameter values - likelihood.
If this model is to be useful, knowing the parameter values would
answer some of the questions you have.
–
–
–
–
–
–
Yes/no-answers, detecting new effects
Decision-making
Prediction
Quantifying effects
Statistics (probabilities, averages, variances etc.)
Results could be used as data for further analysis
With data, the statistical model can be used for sayings something
about the parameter values (inference).
Models and reality
A model that exactly describes reality is unrealistic, but...
If a model contains properties we know not to be the case,
our inference will suffer. (GIGO)
– Unrealistic estimates, effects, uncertainties, probabilities, predictions
– Faulty decision making
– Incorrect answers to yes/no questions.
Give your model the chance to be right!
– Exception: When added realism makes the inference much harder
without affecting the accuracy of what you want answered.
Einstein: Everything should be as simple as possible,
but not simpler! (See also Occam’s razor)
When a model clashes with reality scale data
Confidence interval for average
mammoth body mass
Dataset: x=(5000kg,6000kg,11000kg)
Model 1: xi~N(,2) i.i.d.
– Allows for single mammoths to have negative mass!
– Resulting 95% confidence interval, C()=(-650kg,15300kg) contains
expectation values that cannot possible be correct!
Model 2: log(xi) ~ N(,2) i.i.d. (xi ~ logN(,2) )
– Only positive measured body mass and expectancy, E(xi)=exp(+2/2).
– Resulting 95% bootstrapped confidence interval: (2500kg,10400kg).
• Even better if we could include some prior assumptions.
• Getting an unbiased point estimate is a bit more involved. If only an
unbiased estimate is wanted, model 1 may be better.
When a model clashes with reality –
independence vs timeseries
Simulated water temperature series with expectancy =10.
Assume variance known, =2. Want to estimate  and test
=10.
• Model 1, independence: Ti=+i, i~N(0,1) i.i.d.
– The graph seems to say a different story...
– Estimated: ˆ  x  11.4, sd ( ˆ )  s / n  0.2
– 95% conf. int. for : (11.02,11.80). =10 rejected!
• Model 2, auto-correlated model with expectancy , standard deviation
 and auto-correlation a.
– Linear dependency between the temperature one day and the next.
– Estimated: ˆ  x  11.4, sd ( ˆ )  1.4
– 95% conf. int. for : (8.7,14.10). =10 not rejected.
Some notation:
• Use Pr(x) to denote the probability that a certain random
variable (X) has the value x. Not possible for continuous
variables (except in the form Pr(a<X<b) ). Keep in mind that
probabilities should sum to one: Pr(A)+Pr(not A)=1.
• Use f(x) to denote probability density of a continuous
random variable (X), as a function of it’s input argument, x.
Use f for different such variables, using the input argument
to denote which density we are looking at (f(x) is the density of x,
f(y) is the density of y). Keep in mind that probability density
should integrate to one, f(x)dx=1.
•
I will switch between probability and probability density, since some of the
variables we may be looking at are discrete, while others are continuous.
Parameters, observations and latent variables –
Observations, D
• Observations, D:
o Before getting them, they are random variables. You can not
accurately predict them. They have an element of
stochasticity. You can assign a statistical probability
distribution to them (the model).
o After getting them, they should tell you something about their
distribution and this again should answer the questions you
are looking for.
o Do not gather data that are not relevant to your questions! (If
you have a choice of gathering data that is a little relevant or a lot relevant,
choose the latter.)
Parameters, observations and latent variables –
Parameters, 
• Their values are assumed to be fixed but unknown.
• Getting data (D) should reveal something about the nature of the
parameter set (). The parameters should affect
the outcome of observations.
 D
• I.e. the likelihood, Pr(D|) or f(D| ) as a function of , should not be
flat. Arrows (green) here means we have a specification of the
probability of D given  (the likelihood).
• The reason we are interested in the parameter values is because
that this should answer some questions we may have.
• In frequentist statistics: We look at functions of the data which
relates to the parameters in a simple way, estimators. An estimator
is a random variable, just like the individual data.
  ˆ( D)
Parameters, observations and latent variables –
latent variables, L
• Latent variables (L) are unobserved but random: Pr(L) or f(L).
• Can add realism to our modelling. Stuff we observe can depend
on unobserved states and processes (that has some element of
unpredictability/randomness in them).
• Affects the outcome of observations (D), just like parameters,
Pr(D|L). Thus getting data should reveal something about latent
variables.
LD
• Since both D and L is stochastic, this is a conditional probability.
We need to be able to deal with such...
• Since L are unknown random variables rather than unknown fixed
values, we can use probability theory to sum up what we know
about the latent variables, given the data.
DL
Conditional probabilities – definitions and
intuition
• Pr(B|A) means the probability that B is the case, given that we know
that A is the case. For example Pr(rain | overcast) means the
probability that it rains for those cases when it’s overcast.
• A is probabilistic evidence for B when Pr(B|A)>Pr(B).
• Technically, it is defined by looking at the distribution of both A and B
and then “zooming in on A”: Pr(B|A)=Pr(A and B)/Pr(A).
So it’s that fraction of probability space where both A and B happens
in relation to that fraction of the space of possibilities where B
happens. We remove all possibilities of B not happening.
• Ex: Pr(rain and overcast)/Pr(overcast) = Pr(rain | overcast)
Overcast
Sunny
Rain
Overcast
Rain
Pr(rain and overcast)
Pr(rain | overcast)
(dark area compared to the whole)
(dark area compared to the whole)
Conditional probabilities – combined
probabilities
• If we run the definition of conditional probability backwards,
we get the probability for a combination:
Pr(A and B)=Pr(B|A)Pr(A). (We “zoom out” from the
probability of B when A is certain to the probability of A and
B, where neither A nor B is a certainty.)
Pr(rain | overcast) Pr(overcast) = Pr(rain and overcast)
Overcast
Rain
Overcast
Sunny
Rain
Pr(rain | overcast)
Pr(rain and overcast)
(dark area compared to the whole)
(dark area compared to the whole)
Conditional probabilities – dependence and
information (1)
Independence means Pr(A and B)=Pr(A)Pr(B) which is equivalent to
Pr(B|A)=Pr(B) and Pr(A|B)=Pr(A). For instance, the probability of getting a 6 on the
second throw of the die is the same as getting a six on the second throw, given that you got
3 on the first. Knowing the result for the first die doesn’t help you predict the outcome of the
next.
With dependency, the probabilities change when we condition. Getting
information about A also gives us information (evidence) about B.
AB
The arrows describes how we model. In case of AB, it says that we start
with a probabilistic description of A (Pr(A)) and then supply this with a
description of B given A (Pr(B|A)). This gives us the combined probability,
Pr(A,B)=Pr(A)Pr(B|A).
Typically, we build our models from our understanding of what and how
something affects something else. (For instance, occupancy affects
detections but not vice versa).
Conditional probabilities – dependence and
information (2)
If B depends on A then A depends on B. Dependency flows both ways. We
only use arrows to show in which way we do our modelling
(usually based on out understanding of what affects what).
AB
We can describe the dependency structure of several phenomena:
Ex: A
or
B
C
A
B
C
Pr(A,B,C)=Pr(B)Pr(A|B)Pr(C|B)
We may model by first sketching what influences what. That will then inform
us about the structure of the combined probabilities.
When we specify what influences what and *how*, we have a model.
Concrete example:
A) Carrying capacity  Actual population size  Measured population size
B) Finch egg laying  Season  River discharge
In ABC, knowing B tells us something about both A and C. Knowing A
tells us something about B and C. But, conditioned on B, A says nothing
about C or vice versa.
Pr(A,B,C)=Pr(A)Pr(B|A)Pr(C|B)
Conditional probabilities – from conditional
probabilities to marginal probabilities
• Sometimes we want the distribution of quantity without
having to specify anything else. The law of total probability
says that the marginal (unconditional) probability for B is:
n
n
i 1
i 1
Pr( B)   Pr( B and Ai )  Pr( B | Ai ) Pr( Ai )
n
where the A i ' s are mutually exclusive and  Pr( Ai )  1
i 1
• Example: Pr(rain) =
Pr(rain | overcast)Pr(overcast) + Pr(rain | sunny)Pr(sunny)
Overcast
Rain
•
=
Useful when calculating likelihoods (later).
+
RainRain
Sunny
Conditional probabilities – Bayes theorem
• Looking at latent variables L and data D, we start out with
a specification of L (the marginal probabilities) and a
specification of D given L.
• Know: the data. Unknown: the latent variables. We’re
interested in the opposite of what has been modelled, the
marginal probability of the data and the probability of the
latent variable given the data. Since D and L are
dependent, we can do this.
• The law of total probabilities gives us the first: Pr( D) 
• The definition of conditional probabilities gives use the
second, Bayes theorem:
LD
DL
 P(D | L' ) Pr( L' )
Pr( L, D)  Pr( D | L) Pr( L)  Pr( L | D) Pr( D)

Pr( D | L) Pr( L)
Pr( L | D) 
Pr( D)
For continuous variables,
replace probabilities with
probability densities and
sums with integrals.
Download