/trondr/timeseries_course/timeseries.pdf

advertisement
When today has impact on what happens tomorrow
Time series analysis
Statistical time series are data in time, where what happens at one point in time is
dependent on what happens at other points in time.
The past and the present give information about the future.
Examples:
• population size
• the state of the environment, for instance river discharge
• the state of a single organism
• the number of organisms in a given state
• gene frequency
• the phenotype (such as bodysize) of a lineage in deep time
• the number of species in a clade in deep time
etc, etc, etc.
Explicit time dependency:
When the system is a
function of time with only
independent noise in addition
(typical in physics). Ordinary
statistical regression suffices.
discharge
time
Implicit time dependency: When
there is dependency between what
happens at one time point and
another, even when we check for
explicit time. This is what I’m looking
at here.
When model clashes with reality
– Regression between times series
Checking for correlation between two datasets using standard statistical
regression analysis can often go wrong, when each dataset is a time series.
This is because the model assumptions (independence) are not present.
Uncertainties are typically underestimated.
A test comparing how many standard errors away from zero the estimate is
(score test), can then easily say there’s dependency even if there isn’t.
Here are two independently
simulated time series.
If we plot one against the
other, we may easily be led
to think there’s dependency
between the two series. In
this case, a linear regression
supports this. But this is only
caused by both series being
time series!
Result from R, summary(lm(x2~x1)): x1
-0.47232
0.04747 -9.95 < 2e-16 ***
When model clashes with reality
– independent noise vs time series
The “reality”: A simulated “water temperature” with expectation (long term
average value) µ=10.
Assume overall variance, σ=2. Wish to estimate µ and test whether µ=10.
Model 1, independence: Ti=µ+σεi, εi~N(0,1) i.i.f.
• The graph seems to be telling a different story...
• Estimated: µˆ = x = 11.4, sd ( µˆ ) = σ / n = 0.2
• 95% conf. int. for µ: (11.02,11.80). µ=10
rejected with 95% confidence!
Model 2, auto-correlated model with expectation µ, standard devitation σ
and auto-correlation a.
• Linear dependency between temperature one day and the next.
Ti=aTi-1+(1-a)µ+τεi, εi~N(0,1) i.i.f.
• Estimated:µˆ ≈ x = 11.4, â = 0.958, sd ( µˆ ) = σ / 1 + ( n − 1)(1 − aˆ ) /(1 + aˆ ) ≈ 1.13
• 95% conf. int. for µ: (9.2,13.6). µ=10 not rejected.
How to detect time dependency
There are several ways to get a glimpse into the
nature of a time series (X1,…,Xn):
1.
Auto-correlation. Estimated correlation
between Xt and Xt+lag. Simple test: Do the
first autocorrelations go beyond +/-1.96/√n?
2.
Fourier analysis: Decomposition of a time
series into trigonometric contributions with
different periodicity.
No time dependency => white noise.
Periodicity: Some peaks will stand out.
Auto correlation: Some kind of pattern.
3.
Statistical model comparison
Strategies for dealing with time
dependency
Aggregate the data until you can assume time independence.
1.
•
•
Pro: Easy to do
Con: You’ll easily throw the baby out with the bath water. Also, some people may be curious
about the nature of the time dependencies. Also, you don’t know in advance how long an
aggregation interval you need to assume time independence.
Use standard statistical analysis on models that allows for time dependency and that is
reasonable for your data.
2.
•
•
Pro: If done properly, nobody can fault you for it.
Con: You’ll need to be able to do statistical analysis and quite possibly also programming.
(OpenBUGS/R/Matlab/SAS/C/C++/Fortran). Modeling can be complicated.
Find tools for time series analysis that are suitable (as defined by what kind of time
series model you think is reasonable).
3.
•
•
Pro: If you know enough to find an analysis tool that supports the kind of modeling that is
needed, this is just as good as point 2. If you are good at searching for such tools, this can be
time efficient.
Con: You need to be good at searching and possibly you’ll need some pre-knowledge about what
words to search for. Also, it may be tempting to pick the first tool that one finds without thinking
about whether the model behind accurately represents your data. Also, sometimes the model
may not be adequately described by the tool makers.
I’m directing this lecture towards strategy 2, but will focus mostly on the modeling, so that it’s
also relevant for strategy 3. Note that finding suitable statistical models is a big part of
the work.
A simple auto-regressive model, AR(1)
Let’s go back to the autoregressive model used in
example 2, which was an AR(1) model:
Ti=aTi-1+(1-a)µ+τεi, εi~N(0,1) i.i.f.
(
)
So f (T | T ) = exp − (T − aT − (1 − a ) µ ) 2 / 2τ 2 / 2πτ 2
i
i −1
i
i −1
Markov chain: The process only depends on the past
through the most recent time point, Ti-1. Can find
the likelihood (probability density of the dataset).
Stationarity: No matter where you start from,
Ti→N(µ, σ2) when i → ∞ where , σ2=τ 2/(1-a2).
The autocorrelation drops exponentially.
You can define a characteristic time as the time
needed for the auto correlation to drop below a
fixed value (typically below ½ or 1/e).
Using the likelihood, L(θ)=f(D| θ)
Classical:
Parameter set
Data (time series)
• Estimates: Find the θ that maximizes L(θ). The maximum likelihood
estimator, θˆ, has a several nice properties. Optimization tools: “optim” or
“nls” in R.
• Estimator uncertainty: This can be (asymptotically) derived from Fisher’s
information matrix:
PS: Nonparametric bootstrapping is
typically not an option for time
series analysis!
∂ l (θ )
|θ =θˆ
where l (θ ) = log( L(θ ))
2
∂θ
• A score test can be used for testing whether a parameter has a given
value (for instance zero):
var(θˆ) −1 = − E
2
θ − θˆ
~ N (0,1) ⇒ (θˆ − 1.96 sd (θˆ),θˆ + 1.96 sd (θˆ)) is a 95% conf. int.
sd (θˆ)
• Alternatively, the likelihood ratio test can be used: 2(l (θˆ) − l0 (θˆ0 )) ~ χ k
2
where k is the difference in number of parameters in the zero hypothesis and the
alternative.
• Information criteria (AIC, BIC) typically minimize -2l(θ)+complexity penalty
term, in order to select between models with different complexity.
Using the likelihood, L(θ)=f(D| θ)
Parameter set
Data (time series)
Bayesian:
f ( D | θ ) f (θ )
f (θ | D) =
∝ L(θ ) f (θ )
f ( D)
• Bayesian methods also use the likelihood, but together with a prior
•
•
•
•
distribution for the parameters.
Parameter uncertainty comes out as a distribution.
Tools: WinBUGS/OpenBUGS
Often easier to do for complicated models than classical methods.
Model testing a bit more tricky, but favors parsimonious models even
without penalty terms.
θˆ
Variants of time series processes
The way we build the time series dependency can vary:
Markov chain (next slide)
Hidden Markov chains
Models that use Markov chains as building blocks
Some completely other kind of dependency modeling (Martingales?)
The nature of outcomes we can have, must of course affect the model.
Binary outcomes (Bernouilli)
Categorical outcomes of larger size
Count data up a fixed upper ceiling (typically binomial)
Count data with no fixed upper ceiling (Poisson, negative binomial)
Real valued data (often normal)
Strictly positive real valued data (a log transform brings you back)
Multivariate data (often multinormal)
Models can deal with time in different ways:
Discrete (typically used for equidistant data) time
Continuous time (difficult)
General time series theory:
Markov chains
I think we can agree. The past is over. – G. W. Bush
With time dependency, each new measurement depends on the past.
f(T11,…,T
,T ) … f(Tn |n T| n-1
,…,
f(T
,…,Tn)=f(T
) f(T
Tn-1
) T2,T1)
1)1f(T
2 |2T|1)Tf(T
3|T
n)=f(T
1) f(T
32| T12) … f(T
The likelihood is the product of all one-step-ahead predictions. Unless we do
something to restrict the complexity, it will grow exponentially!
Markov-chain: Assume that the future depends on the past and present only
through the present: f(Ti | Ti-1,…, T2,T1)= f(Ti | Ti-1)
With a single model for all the transitions, the model complexity will be greatly
reduced.
Can deal with the start by having it as a parameter or by assuming stationarity.
You can make a graph showing the dependencies:
X1
X2
X3
X4
Markov chains – binary outcomes
Let’s say we want to model rain at a given site.
We make a threshold so that it’s either raining
(R) or not (N). Each day, the time series Xt will
be in one of these modes.
pRR
R
If it’s raining one day, there’s a probability that it
will continue the next day pRR=Pr(Xt=R| Xt-1=R)
and a probability it will stop,
pNR=Pr(Xt=N| Xt-1=R)=1- pRR.
Similarly, if it’s not raining one day: pNN=Pr(Xt=N|
Xt-1=N) and pRN=Pr(Xt=R| Xt-1=N)=1- pNN.
Parameters: pRN and pRR
Stationary: pR= pRN /(1+ pRN- pRR)
X1
=Y
X1
=N
L=pR (1-pRR)
X3
=N
X4
=Y
X5
=Y
X6
=Y
X7
=N
(1-pRN)
pRN
pRR
pRR
(1-pRR)
pNR
pRN
pNN
N
Video: Google ”RUU” +”Markov”
Markov chains – life cycles
Markov chains when the outcomes belong to larger sets of categories;
Life cycles.
Example: The xenomorphs in the Alien movie series:
egg
facehugger
youngling
(chestbuster)
adult
queen
Transient states
Transition probabilities specified by a
matrix (rows=old state, columns= new state. Arrow means a
positive transition probability.
If the life cycles are age categories, then
this plus reproduction rates used for
calculating populations sizes in different
age categories, gives the Leslie matrix
approach.
Dead
Absorbing
state
Markov chains – the Wright-Fisher model
(time dependent binomial model)
Counting data outcomes: We have an allele, A, which is neutral,
(compared to it’s counterpart a), in a population of N diploid organisms.
Of interest: The number of A’s at a given time, Xt, which can vary from
0 to 2N. Probability of an A in any given position in the next generation
is independent from the other positions in that generation but
proportional to the number of A’s in the previous generation. So
X

X i | X i −1 ~ binom i −1 ,2 N 
 2N

There are two absorbing states
here, namely 0 and 2N.
PS: Even if A was not a neutral
allele, you could fit a WF process
to the data you had. The likelihood
would however be low (compared to a
model with differential fitness). The
likelihood tells you how well the
model predicts the time series data
rather than how well it fits.
R code for this simulation is
Given on my web pages.
Markov chains – the Random Walk
A random walk (RW) has real valued outcomes. We start from any
position (often X1=0) and then add independent random noise each
turn so that.
X t | X t −1 = X t −1 + ε t
where ε t ~ N ( µ , σ 2 )
If µ=0, it is an unbiased RW. If not, it will tend to wander in a given
direction.
Since we’re constantly adding noise, the variance will become larger
and larger. Var(Xt)=tσ2. It is thus not stationary! (AR(1) is)
Random walks have been proposed as null
hypothesis for large time scale evolution.
Random walks can be defined in continuous
time as well as for discrete time.
Just as the WF model for count data, RWs can
be fitted (perfectly) to any data with continuous
outcomes. That doesn’t mean it was a random
PS: It’s easy to see “patterns” that aren’t really there.
walk that produced it.
ARMA models
An ARMA model contains two components: an
AutoRegressive part (Markov chain) and a Moving Average part.
X t = ϕ1 X t −1 + ... + ϕ p X t − p + θ1ε t −1 + ... + θ qε t − q + ε t
With only the autoregressive part, this would be a Markov chain (conditioned on the state (Xt-1,…,Xt-p). The extra
dependency to past noise terms ruins this. But note that these noise terms are themselves the simplest type of
Markov chain, namely one with no dependency on the past whatsoever. Thus ARMA models are put together by
components that are Markov chains.
Pro: Analysis of ARMA models are implemented in many packages.
Con: They tend to be “black box” models. They can be fitted to data, but interpreting
these models in the context of the field of study can be hard.
Connecting to other time series can be done through so-called “transfer terms”.
Yt = g 0 X t + g1 X t −1 + ... + g p ' X t − p ' + θ '1 ε t −1 + ... + θ 'q ' ε t − q ' + ε t
Hidden Markov chain models
A hidden Markov model contains several “layers” of explanation. You
have a state that evolves according to a Markov chain. You are
however not able to get accurate measurements of it, but you can
get noisy measurements of some of it’s components.
State:
Observations:
X1
X2
X3
Xn
Y1
Y2
Y3
Yn
time
For normal linear models, this is what’s known as the Kalman filter. It’s
possible to do inference on the state and derive a likelihood analytically in
such cases. For discrete states and outcomes, analytical treatment can also be possible (occupancy modeling).
For cases where this can’t be done analytically, you can either use MCMC techniques
in Bayesian statistics, or so-called particle filters.
Example using the Kalman filter
Three water temperature series measured fairly close to each
other. Some of the data was removed .
State model: Vectorial AR(1) with correlated noise between
the three series. Normal observational noise.
The plots show how missing data are filled out and shown
together with the inference uncertainty. Since the models
allow for correlation between the sites, the temperature at
one station informs about the temperature at another.
Where there is data missing in all stations, the
uncertainty will “bubble” out.
Could have used a vectorial AR(1) model, but instead I used it’s continuous time
parent, the Ornstein-Uhlenbeck process. With a continuous time model, the
state between measurements, can also be inferred.
Hidden components
There are known knowns; there are things we know we know.
We also know there are known unknowns; that is to say we know there are some things we do not know.
But there are also unknown unknowns – the ones we don't know we don't know. - Donald Rumsfeld
Not all components of a Markov chain need to be directly measured.
If some unmeasured components affect the process you are
interested in, then the dynamics of the unmeasured components
Auto-correlation of
can affect the dynamics of that process.
phenotype affected by
dynamic optimum vs the
same of phenotype with
optimum assumed constant.
The top layer (the phenotype) will not be Markovian by
itself , but will be so conditioned on the processes that
affects it. The system as a whole is a Markov chain.
Even after taking into account our ”known
unknowns” there could be residual
dependencies, suggesting the presence of
relevant ”unknown unknowns”.
Time series resources
 Web page: http://folk.uio.no/trondr/timeseries_course
 Books:
 Box, Jenkins & Reinsel: Time Series Analysis
(This is the book that introduced ARMA models)
 Shumway & Stoffer: Time Series Analysis and Its
Applications (ARMA models and Fourier analysis)
 Taylor & Karlin: An Introduction to Stochastic Modeling
(Contains much about finite state Markov models, mostly discrete time but a little about
continuous time processes also)
 West & Harrison: Bayesian Forecasting and Dynamic
Models (Built around the Kalman filter. Hidden Markov Models having linear normal
updates and with (mostly) known parameters.)
Continuous time processes
 The Poisson process:
events
time
 Independent events
 Max one event at a given time point.
 If you count the number of events in an interval, it
will be Poisson distributed.
 The time to the next event, from any given starting
point, is exponentially distributed.
• Birth-death processes:
 Count data (population size).
 Max one birth or death at a given time point.
 Specified with infinitesimal transition
probabilities.
• Stochastic differential equations:
 Real valued outcomes.
 Differential equations plus infinitesimal normal
contributions.
 Examples: continuous time random walk, OrnsteinUhlenbeck
time
Download