Statistical modelling and latent variables. Constructing models based on insight and motivation Statistical modelling - why? A statistical model describes a possible distribution of incoming data, given some (unknown) parameter values - likelihood. If this model is to be useful, knowing the parameter values would answer some of the questions you have. – – – – – – Yes/no-answers, detecting new effects Decision-making Prediction Quantifying effects Statistics (probabilities, averages, variances etc.) Results could be used as data for further analysis With data, the statistical model can be used for sayings something about the parameter values (inference). Models and reality A model that exactly describes reality is unrealistic, but... If a model contains properties we know not to be the case, our inference will suffer. (GIGO) – Unrealistic estimates, effects, uncertainties, probabilities, predictions – Faulty decision making – Incorrect answers to yes/no questions. Give your model the chance to be right! – Exception: When added realism makes the inference much harder without affecting the accuracy of what you want answered. Einstein: Everything should be as simple as possible, but not simpler! (See also Occam’s razor) When a model clashes with reality scale data Confidence interval for average mammoth body mass Dataset: x=(5000kg,6000kg,11000kg) Model 1: xi~N(,2) i.i.d. – Allows for single mammoths to have negative mass! – Resulting 95% confidence interval, C()=(-650kg,15300kg) contains expectation values that cannot possible be correct! Model 2: log(xi) ~ N(,2) i.i.d. (xi ~ logN(,2) ) – Only positive measured body mass and expectancy, E(xi)=exp(+2/2). – Resulting 95% bootstrapped confidence interval: (2500kg,10400kg). • Even better if we could include some prior assumptions. • Getting an unbiased point estimate is a bit more involved. If only an unbiased estimate is wanted, model 1 may be better. When a model clashes with reality – independence vs timeseries Simulated water temperature series with expectancy =10. Assume variance known, =2. Want to estimate and test =10. • Model 1, independence: Ti=+i, i~N(0,1) i.i.d. – The graph seems to say a different story... – Estimated: ˆ x 11.4, sd ( ˆ ) s / n 0.2 – 95% conf. int. for : (11.02,11.80). =10 rejected! • Model 2, auto-correlated model with expectancy , standard deviation and auto-correlation a. – Linear dependency between the temperature one day and the next. – Estimated: ˆ x 11.4, sd ( ˆ ) 1.4 – 95% conf. int. for : (8.7,14.10). =10 not rejected. Some notation: • Use Pr(x) to denote the probability that a certain random variable (X) has the value x. Not possible for continuous variables (except in the form Pr(a<X<b) ). Keep in mind that probabilities should sum to one: Pr(A)+Pr(not A)=1. • Use f(x) to denote probability density of a continuous random variable (X), as a function of it’s input argument, x. Use f for different such variables, using the input argument to denote which density we are looking at (f(x) is the density of x, f(y) is the density of y). Keep in mind that probability density should integrate to one, f(x)dx=1. • I will switch between probability and probability density, since some of the variables we may be looking at are discrete, while others are continuous. Parameters, observations and latent variables – Observations, D • Observations, D: o Before getting them, they are random variables. You can not accurately predict them. They have an element of stochasticity. You can assign a statistical probability distribution to them (the model). o After getting them, they should tell you something about their distribution and this again should answer the questions you are looking for. o Do not gather data that are not relevant to your questions! (If you have a choice of gathering data that is a little relevant or a lot relevant, choose the latter.) Parameters, observations and latent variables – Parameters, • Their values are assumed to be fixed but unknown. • Getting data (D) should reveal something about the nature of the parameter set (). The parameters should affect the outcome of observations. D • I.e. the likelihood, Pr(D|) or f(D| ) as a function of , should not be flat. Arrows (green) here means we have a specification of the probability of D given (the likelihood). • The reason we are interested in the parameter values is because that this should answer some questions we may have. • In frequentist statistics: We look at functions of the data which relates to the parameters in a simple way, estimators. An estimator is a random variable, just like the individual data. ˆ( D) Parameters, observations and latent variables – latent variables, L • Latent variables (L) are unobserved but random: Pr(L) or f(L). • Can add realism to our modelling. Stuff we observe can depend on unobserved states and processes (that has some element of unpredictability/randomness in them). • Affects the outcome of observations (D), just like parameters, Pr(D|L). Thus getting data should reveal something about latent variables. LD • Since both D and L is stochastic, this is a conditional probability. We need to be able to deal with such... • Since L are unknown random variables rather than unknown fixed values, we can use probability theory to sum up what we know about the latent variables, given the data. DL Conditional probabilities – definitions and intuition • Pr(B|A) means the probability that B is the case, given that we know that A is the case. For example Pr(rain | overcast) means the probability that it rains for those cases when it’s overcast. • A is probabilistic evidence for B when Pr(B|A)>Pr(B). • Technically, it is defined by looking at the distribution of both A and B and then “zooming in on A”: Pr(B|A)=Pr(A and B)/Pr(A). So it’s that fraction of probability space where both A and B happens in relation to that fraction of the space of possibilities where B happens. We remove all possibilities of B not happening. • Ex: Pr(rain and overcast)/Pr(overcast) = Pr(rain | overcast) Overcast Sunny Rain Overcast Rain Pr(rain and overcast) Pr(rain | overcast) (dark area compared to the whole) (dark area compared to the whole) Conditional probabilities – combined probabilities • If we run the definition of conditional probability backwards, we get the probability for a combination: Pr(A and B)=Pr(B|A)Pr(A). (We “zoom out” from the probability of B when A is certain to the probability of A and B, where neither A nor B is a certainty.) Pr(rain | overcast) Pr(overcast) = Pr(rain and overcast) Overcast Rain Overcast Sunny Rain Pr(rain | overcast) Pr(rain and overcast) (dark area compared to the whole) (dark area compared to the whole) Conditional probabilities – dependence and information (1) Independence means Pr(A and B)=Pr(A)Pr(B) which is equivalent to Pr(B|A)=Pr(B) and Pr(A|B)=Pr(A). For instance, the probability of getting a 6 on the second throw of the die is the same as getting a six on the second throw, given that you got 3 on the first. Knowing the result for the first die doesn’t help you predict the outcome of the next. With dependency, the probabilities change when we condition. Getting information about A also gives us information (evidence) about B. AB The arrows describes how we model. In case of AB, it says that we start with a probabilistic description of A (Pr(A)) and then supply this with a description of B given A (Pr(B|A)). This gives us the combined probability, Pr(A,B)=Pr(A)Pr(B|A). Typically, we build our models from our understanding of what and how something affects something else. (For instance, occupancy affects detections but not vice versa). Conditional probabilities – dependence and information (2) If B depends on A then A depends on B. Dependency flows both ways. We only use arrows to show in which way we do our modelling (usually based on out understanding of what affects what). AB We can describe the dependency structure of several phenomena: Ex: A or B C A B C Pr(A,B,C)=Pr(B)Pr(A|B)Pr(C|B) We may model by first sketching what influences what. That will then inform us about the structure of the combined probabilities. When we specify what influences what and *how*, we have a model. Concrete example: A) Carrying capacity Actual population size Measured population size B) Finch egg laying Season River discharge In ABC, knowing B tells us something about both A and C. Knowing A tells us something about B and C. But, conditioned on B, A says nothing about C or vice versa. Pr(A,B,C)=Pr(A)Pr(B|A)Pr(C|B) Conditional probabilities – from conditional probabilities to marginal probabilities • Sometimes we want the distribution of quantity without having to specify anything else. The law of total probability says that the marginal (unconditional) probability for B is: n n i 1 i 1 Pr( B) Pr( B and Ai ) Pr( B | Ai ) Pr( Ai ) n where the A i ' s are mutually exclusive and Pr( Ai ) 1 i 1 • Example: Pr(rain) = Pr(rain | overcast)Pr(overcast) + Pr(rain | sunny)Pr(sunny) Overcast Rain • = Useful when calculating likelihoods (later). + RainRain Sunny Conditional probabilities – Bayes theorem • Looking at latent variables L and data D, we start out with a specification of L (the marginal probabilities) and a specification of D given L. • Know: the data. Unknown: the latent variables. We’re interested in the opposite of what has been modelled, the marginal probability of the data and the probability of the latent variable given the data. Since D and L are dependent, we can do this. • The law of total probabilities gives us the first: Pr( D) • The definition of conditional probabilities gives use the second, Bayes theorem: LD DL P(D | L' ) Pr( L' ) Pr( L, D) Pr( D | L) Pr( L) Pr( L | D) Pr( D) Pr( D | L) Pr( L) Pr( L | D) Pr( D) For continuous variables, replace probabilities with probability densities and sums with integrals.