The generalization of Bayes for continuous densities is that we have some density f(y|) where y and are vectors of data and parameters with being sampled from a prior (|) where the are hyperparameters. If is known then Bayesian updating is p( | y, ) f (y | ) ( | ) f (y | ) ( | )d If is not known then updating depends upon a distribution h()the hyperprior p( | y, ) f (y | ) ( | )h()d f (y | ) ( | )h()dd For example, think back to the case of infested tree nuts in which now there is prob pk of any nut being infested in location k. Now suppose the pk’s for k=1,2 are possibly jointly distributed with a prior looking “like a Beta” with density: (a b c) a1 b1 g( p1, p2 ) p1 p2 (1 p1 p2 ) c1 (a)(b)(c) This is a Dirichlet distribution and the parameters a, b and c are hyperparameters for the prior distribution of the pk which are themselves parameters for a binomial distribution of samples of nuts from the 2 locations. If the hyperparameters are themselves determined through some process depending upon distance between the locations, then this is amenable to hierarchical Bayesian modeling. The hyperparameters (or a,b, c in the example) might specify how the parameters vary in space or time between observations. One possible approach is to estimate the for example by choosing it to maximize the marginal distribution of the data as a function of by choosing it to maximize p(y | ) f (y | ) ( | )d Giving an estimate ˆ) p( | y, ˆ and an estimated posterior This is called an empirical Bayes approach. For the example, we could estimate a,b,c from multiple samples of nuts from the two locations, and plug in these estimates, but this isn’t very “Bayes-like” since we ignore information about distance! Sufficient statistics Suppose we take a sample X1,…Xn from a distribution family {f(x,)} with a statistic Y1 = u1(X1,…Xn). Then Y1 is a sufficient statistic for if and only if for any other statistics Y2=u2(X1,…Xn), … , Yn=un(X1,…Xn) the conditional pdf h(y2,…,yn | y1 ) of Y2,…,Yn given Y1=y1 does not depend upon no matter what the value of y1 is. So given that we know Y1=y1 , it isn’t possible to use any other statistic Y2 to make any inference about . Y1 exhausts all the information about that is contained in the sample. Normal Distribution example of Bayesian updating Sample Y from N(,2) with 2 known and assume a Normal prior for with mean and variance 2. Then posterior for given Y=y is also Normal (a conjugate prior) with mean 2 2 2 2 2 2 y And variance (precision is 1/variance) 2 2 1 2 2 1 1 2 2 Now take a sample of size n Y=(y1,…,yn) from N(,2) with 2 known and assume a Normal prior for with mean and variance 2. Then posterior for given Y is the same as the posterior for given the sample mean y is Normal with mean 2 n 2 2 y 2 2 2 n n And variance (precision is 1/variance) 2 2 1 2 2 n 1 n 2 2 As gets large, posterior mean goes to y and variance goes to 2/n - the likelihood in the limit of noninformative prior. Bayesian approaches and Markov processes One of the most useful numerical techniques for Bayesian analysis involves the use of Markov Chains in an appropriate way to estimate the complex integrals that arise. So we do a quick summary of Markov chains, say something about their ergodic properties, and note how one can simulate them. A stochastic process is a collection of random variables indexed elements of a set (usually a time interval or generations over some time) A stochastic process is well defined if we know the joint distributions of any collection of random variables in it. A Markov process is a stochastic process for which the distribution of future process values, given the present value, is independent of the past values (Future is independent of the past, given the present) P[X tn1 x n 1 | X t n x n ,..., X t1 x1 ] P[X t n1 x n 1 | X tn x n ] For any t1< …< tn < tn+1 and x1, …, xn, xn+1 This property makes it easy to calculate the joint distributions of the process at any list of times by knowing the initial distribution and the transition probability functions which specify the probability of transition to any value given the present value. A Markov chain is a Markov process for which the states take on values from some discrete set (e.g. in classic pop gen, it would be allele frequencies 0, 1/2N, …., 1). If the index set is discrete time, we describe transitions by a transition matrix Pi, j (n) P[X t j | X t i] n1 n Which gives the i,j entry of the matrix at time n. If the matrix is the same at any time, the transition is called stationary (note this does not mean matrix the Markov process is stationary - it will generally not be unless the initial condition is its stationary distribution). For continuous-time Markov processes, the matrix is replaced by a probability density function, conditioned on the present. Ergodicity is a property of some stochastic processes for which there are consistent, long-term behaviors, the simplest being a limiting distribution meaning that the process eventually comes close to having a particular distribution Pi, j (n) P[X tn1 j | X t n i] j as n Ergodicity implies that for long-enough samples, time averages and sample averages are equivalent. A Markov Chain with stationary transition matrix which is aperiodic, irreducible and recurrent has a stationary distribution which is its limiting distribution (it is approached independent of initial distribution). For discrete-time, finite state space case, limiting distribution is left eigenvector of transition matrix for the dominant eigenvalue (1). Continuous time Markov chains have close connections to the exponential distribution since the distribution of time in any state is exponential - let the rate of transition from state i to state j be qij then a transition from state i takes place after an exponential amount of time with parameter q ij ji At the time of transition the Chain moves from state I qij to state j with probability pij qik ki The matrix Q is the infinitesimal generator of the Chain where so the rows sum to 0 qii qij ji A discrete time sample of a continuous time MC can be viewed in two ways - as following the MC at fixed intervals, so in some intervals the chain doesn’t change state, or by following the MC only at times of observable events (discrete event simulation view) when it changes state The discrete event view provides a method to simulate a MC - sometimes called the Gillespie algorithm for the application to chemical reaction transitions To simulate: (i) choose initial state i based upon some initial distribution using a pseudo-random number qij pij generator. qii qij q ij q ik ji ji ki (ii) choose transition event time by simulating an exponential distribution with rate parameter -qii (iii) Choose next state to be state j with probability qij pij qik ki (iv) Go back to (ii) with new state if total time is less than desired simulation stopping time (v) Otherwise stop, or if multiple trajectories are desired, go back to (i)