When today has impact on what happens tomorrow Time series analysis Statistical time series are data in time, where what happens at one point in time is dependent on what happens at other points in time. The past and the present give information about the future. Examples: • population size • the state of the environment, for instance river discharge • the state of a single organism • the number of organisms in a given state • gene frequency • the phenotype (such as bodysize) of a lineage in deep time • the number of species in a clade in deep time etc, etc, etc. Explicit time dependency: When the system is a function of time with only independent noise in addition (typical in physics). Ordinary statistical regression suffices. discharge time Implicit time dependency: When there is dependency between what happens at one time point and another, even when we check for explicit time. This is what I’m looking at here. When model clashes with reality – Regression between times series Checking for correlation between two datasets using standard statistical regression analysis can often go wrong, when each dataset is a time series. This is because the model assumptions (independence) are not present. Uncertainties are typically underestimated. A test comparing how many standard errors away from zero the estimate is (score test), can then easily say there’s dependency even if there isn’t. Here are two independently simulated time series. If we plot one against the other, we may easily be led to think there’s dependency between the two series. In this case, a linear regression supports this. But this is only caused by both series being time series! Result from R, summary(lm(x2~x1)): x1 -0.47232 0.04747 -9.95 < 2e-16 *** When model clashes with reality – independent noise vs time series The “reality”: A simulated “water temperature” with expectation (long term average value) µ=10. Assume overall variance, σ=2. Wish to estimate µ and test whether µ=10. Model 1, independence: Ti=µ+σεi, εi~N(0,1) i.i.f. • The graph seems to be telling a different story... • Estimated: µˆ = x = 11.4, sd ( µˆ ) = σ / n = 0.2 • 95% conf. int. for µ: (11.02,11.80). µ=10 rejected with 95% confidence! Model 2, auto-correlated model with expectation µ, standard devitation σ and auto-correlation a. • Linear dependency between temperature one day and the next. Ti=aTi-1+(1-a)µ+τεi, εi~N(0,1) i.i.f. • Estimated:µˆ ≈ x = 11.4, â = 0.958, sd ( µˆ ) = σ / 1 + ( n − 1)(1 − aˆ ) /(1 + aˆ ) ≈ 1.13 • 95% conf. int. for µ: (9.2,13.6). µ=10 not rejected. How to detect time dependency There are several ways to get a glimpse into the nature of a time series (X1,…,Xn): 1. Auto-correlation. Estimated correlation between Xt and Xt+lag. Simple test: Do the first autocorrelations go beyond +/-1.96/√n? 2. Fourier analysis: Decomposition of a time series into trigonometric contributions with different periodicity. No time dependency => white noise. Periodicity: Some peaks will stand out. Auto correlation: Some kind of pattern. 3. Statistical model comparison Strategies for dealing with time dependency Aggregate the data until you can assume time independence. 1. • • Pro: Easy to do Con: You’ll easily throw the baby out with the bath water. Also, some people may be curious about the nature of the time dependencies. Also, you don’t know in advance how long an aggregation interval you need to assume time independence. Use standard statistical analysis on models that allows for time dependency and that is reasonable for your data. 2. • • Pro: If done properly, nobody can fault you for it. Con: You’ll need to be able to do statistical analysis and quite possibly also programming. (OpenBUGS/R/Matlab/SAS/C/C++/Fortran). Modeling can be complicated. Find tools for time series analysis that are suitable (as defined by what kind of time series model you think is reasonable). 3. • • Pro: If you know enough to find an analysis tool that supports the kind of modeling that is needed, this is just as good as point 2. If you are good at searching for such tools, this can be time efficient. Con: You need to be good at searching and possibly you’ll need some pre-knowledge about what words to search for. Also, it may be tempting to pick the first tool that one finds without thinking about whether the model behind accurately represents your data. Also, sometimes the model may not be adequately described by the tool makers. I’m directing this lecture towards strategy 2, but will focus mostly on the modeling, so that it’s also relevant for strategy 3. Note that finding suitable statistical models is a big part of the work. A simple auto-regressive model, AR(1) Let’s go back to the autoregressive model used in example 2, which was an AR(1) model: Ti=aTi-1+(1-a)µ+τεi, εi~N(0,1) i.i.f. ( ) So f (T | T ) = exp − (T − aT − (1 − a ) µ ) 2 / 2τ 2 / 2πτ 2 i i −1 i i −1 Markov chain: The process only depends on the past through the most recent time point, Ti-1. Can find the likelihood (probability density of the dataset). Stationarity: No matter where you start from, Ti→N(µ, σ2) when i → ∞ where , σ2=τ 2/(1-a2). The autocorrelation drops exponentially. You can define a characteristic time as the time needed for the auto correlation to drop below a fixed value (typically below ½ or 1/e). Using the likelihood, L(θ)=f(D| θ) Classical: Parameter set Data (time series) • Estimates: Find the θ that maximizes L(θ). The maximum likelihood estimator, θˆ, has a several nice properties. Optimization tools: “optim” or “nls” in R. • Estimator uncertainty: This can be (asymptotically) derived from Fisher’s information matrix: PS: Nonparametric bootstrapping is typically not an option for time series analysis! ∂ l (θ ) |θ =θˆ where l (θ ) = log( L(θ )) 2 ∂θ • A score test can be used for testing whether a parameter has a given value (for instance zero): var(θˆ) −1 = − E 2 θ − θˆ ~ N (0,1) ⇒ (θˆ − 1.96 sd (θˆ),θˆ + 1.96 sd (θˆ)) is a 95% conf. int. sd (θˆ) • Alternatively, the likelihood ratio test can be used: 2(l (θˆ) − l0 (θˆ0 )) ~ χ k 2 where k is the difference in number of parameters in the zero hypothesis and the alternative. • Information criteria (AIC, BIC) typically minimize -2l(θ)+complexity penalty term, in order to select between models with different complexity. Using the likelihood, L(θ)=f(D| θ) Parameter set Data (time series) Bayesian: f ( D | θ ) f (θ ) f (θ | D) = ∝ L(θ ) f (θ ) f ( D) • Bayesian methods also use the likelihood, but together with a prior • • • • distribution for the parameters. Parameter uncertainty comes out as a distribution. Tools: WinBUGS/OpenBUGS Often easier to do for complicated models than classical methods. Model testing a bit more tricky, but favors parsimonious models even without penalty terms. θˆ Variants of time series processes The way we build the time series dependency can vary: Markov chain (next slide) Hidden Markov chains Models that use Markov chains as building blocks Some completely other kind of dependency modeling (Martingales?) The nature of outcomes we can have, must of course affect the model. Binary outcomes (Bernouilli) Categorical outcomes of larger size Count data up a fixed upper ceiling (typically binomial) Count data with no fixed upper ceiling (Poisson, negative binomial) Real valued data (often normal) Strictly positive real valued data (a log transform brings you back) Multivariate data (often multinormal) Models can deal with time in different ways: Discrete (typically used for equidistant data) time Continuous time (difficult) General time series theory: Markov chains I think we can agree. The past is over. – G. W. Bush With time dependency, each new measurement depends on the past. f(T11,…,T ,T ) … f(Tn |n T| n-1 ,…, f(T ,…,Tn)=f(T ) f(T Tn-1 ) T2,T1) 1)1f(T 2 |2T|1)Tf(T 3|T n)=f(T 1) f(T 32| T12) … f(T The likelihood is the product of all one-step-ahead predictions. Unless we do something to restrict the complexity, it will grow exponentially! Markov-chain: Assume that the future depends on the past and present only through the present: f(Ti | Ti-1,…, T2,T1)= f(Ti | Ti-1) With a single model for all the transitions, the model complexity will be greatly reduced. Can deal with the start by having it as a parameter or by assuming stationarity. You can make a graph showing the dependencies: X1 X2 X3 X4 Markov chains – binary outcomes Let’s say we want to model rain at a given site. We make a threshold so that it’s either raining (R) or not (N). Each day, the time series Xt will be in one of these modes. pRR R If it’s raining one day, there’s a probability that it will continue the next day pRR=Pr(Xt=R| Xt-1=R) and a probability it will stop, pNR=Pr(Xt=N| Xt-1=R)=1- pRR. Similarly, if it’s not raining one day: pNN=Pr(Xt=N| Xt-1=N) and pRN=Pr(Xt=R| Xt-1=N)=1- pNN. Parameters: pRN and pRR Stationary: pR= pRN /(1+ pRN- pRR) X1 =Y X1 =N L=pR (1-pRR) X3 =N X4 =Y X5 =Y X6 =Y X7 =N (1-pRN) pRN pRR pRR (1-pRR) pNR pRN pNN N Video: Google ”RUU” +”Markov” Markov chains – life cycles Markov chains when the outcomes belong to larger sets of categories; Life cycles. Example: The xenomorphs in the Alien movie series: egg facehugger youngling (chestbuster) adult queen Transient states Transition probabilities specified by a matrix (rows=old state, columns= new state. Arrow means a positive transition probability. If the life cycles are age categories, then this plus reproduction rates used for calculating populations sizes in different age categories, gives the Leslie matrix approach. Dead Absorbing state Markov chains – the Wright-Fisher model (time dependent binomial model) Counting data outcomes: We have an allele, A, which is neutral, (compared to it’s counterpart a), in a population of N diploid organisms. Of interest: The number of A’s at a given time, Xt, which can vary from 0 to 2N. Probability of an A in any given position in the next generation is independent from the other positions in that generation but proportional to the number of A’s in the previous generation. So X X i | X i −1 ~ binom i −1 ,2 N 2N There are two absorbing states here, namely 0 and 2N. PS: Even if A was not a neutral allele, you could fit a WF process to the data you had. The likelihood would however be low (compared to a model with differential fitness). The likelihood tells you how well the model predicts the time series data rather than how well it fits. R code for this simulation is Given on my web pages. Markov chains – the Random Walk A random walk (RW) has real valued outcomes. We start from any position (often X1=0) and then add independent random noise each turn so that. X t | X t −1 = X t −1 + ε t where ε t ~ N ( µ , σ 2 ) If µ=0, it is an unbiased RW. If not, it will tend to wander in a given direction. Since we’re constantly adding noise, the variance will become larger and larger. Var(Xt)=tσ2. It is thus not stationary! (AR(1) is) Random walks have been proposed as null hypothesis for large time scale evolution. Random walks can be defined in continuous time as well as for discrete time. Just as the WF model for count data, RWs can be fitted (perfectly) to any data with continuous outcomes. That doesn’t mean it was a random PS: It’s easy to see “patterns” that aren’t really there. walk that produced it. ARMA models An ARMA model contains two components: an AutoRegressive part (Markov chain) and a Moving Average part. X t = ϕ1 X t −1 + ... + ϕ p X t − p + θ1ε t −1 + ... + θ qε t − q + ε t With only the autoregressive part, this would be a Markov chain (conditioned on the state (Xt-1,…,Xt-p). The extra dependency to past noise terms ruins this. But note that these noise terms are themselves the simplest type of Markov chain, namely one with no dependency on the past whatsoever. Thus ARMA models are put together by components that are Markov chains. Pro: Analysis of ARMA models are implemented in many packages. Con: They tend to be “black box” models. They can be fitted to data, but interpreting these models in the context of the field of study can be hard. Connecting to other time series can be done through so-called “transfer terms”. Yt = g 0 X t + g1 X t −1 + ... + g p ' X t − p ' + θ '1 ε t −1 + ... + θ 'q ' ε t − q ' + ε t Hidden Markov chain models A hidden Markov model contains several “layers” of explanation. You have a state that evolves according to a Markov chain. You are however not able to get accurate measurements of it, but you can get noisy measurements of some of it’s components. State: Observations: X1 X2 X3 Xn Y1 Y2 Y3 Yn time For normal linear models, this is what’s known as the Kalman filter. It’s possible to do inference on the state and derive a likelihood analytically in such cases. For discrete states and outcomes, analytical treatment can also be possible (occupancy modeling). For cases where this can’t be done analytically, you can either use MCMC techniques in Bayesian statistics, or so-called particle filters. Example using the Kalman filter Three water temperature series measured fairly close to each other. Some of the data was removed . State model: Vectorial AR(1) with correlated noise between the three series. Normal observational noise. The plots show how missing data are filled out and shown together with the inference uncertainty. Since the models allow for correlation between the sites, the temperature at one station informs about the temperature at another. Where there is data missing in all stations, the uncertainty will “bubble” out. Could have used a vectorial AR(1) model, but instead I used it’s continuous time parent, the Ornstein-Uhlenbeck process. With a continuous time model, the state between measurements, can also be inferred. Hidden components There are known knowns; there are things we know we know. We also know there are known unknowns; that is to say we know there are some things we do not know. But there are also unknown unknowns – the ones we don't know we don't know. - Donald Rumsfeld Not all components of a Markov chain need to be directly measured. If some unmeasured components affect the process you are interested in, then the dynamics of the unmeasured components Auto-correlation of can affect the dynamics of that process. phenotype affected by dynamic optimum vs the same of phenotype with optimum assumed constant. The top layer (the phenotype) will not be Markovian by itself , but will be so conditioned on the processes that affects it. The system as a whole is a Markov chain. Even after taking into account our ”known unknowns” there could be residual dependencies, suggesting the presence of relevant ”unknown unknowns”. Time series resources Web page: http://folk.uio.no/trondr/timeseries_course Books: Box, Jenkins & Reinsel: Time Series Analysis (This is the book that introduced ARMA models) Shumway & Stoffer: Time Series Analysis and Its Applications (ARMA models and Fourier analysis) Taylor & Karlin: An Introduction to Stochastic Modeling (Contains much about finite state Markov models, mostly discrete time but a little about continuous time processes also) West & Harrison: Bayesian Forecasting and Dynamic Models (Built around the Kalman filter. Hidden Markov Models having linear normal updates and with (mostly) known parameters.) Continuous time processes The Poisson process: events time Independent events Max one event at a given time point. If you count the number of events in an interval, it will be Poisson distributed. The time to the next event, from any given starting point, is exponentially distributed. • Birth-death processes: Count data (population size). Max one birth or death at a given time point. Specified with infinitesimal transition probabilities. • Stochastic differential equations: Real valued outcomes. Differential equations plus infinitesimal normal contributions. Examples: continuous time random walk, OrnsteinUhlenbeck time