Day 2

advertisement
How to get data and model to fit together?
Experimental planning
Statistical methods can be used in order to better plan ones
experiments:
• You might want to be able to detect (with 95% confidence) a
difference larger than a given value between two sets of mean
precipitations with a given probability (power). If so, plot the
power as a function of the data size and see when the
rejection probability (power) goes above your wanted level.
• If you want to do enough stage-discharge measurements that
the uncertainty of the rating curve exponent (b) is with 90%
probability less than 0.1, you either need an analytical
expression for the uncertainty as a function of the data size
and the parameter values, or you need to do simulation.
When model + methodology clashes with
reality
Wish to have the relationship between
stage (h) and discharge (Q) on the
following form:
C
h
Q=C(h-h0)b
where h0 is the zero plane, b gives the
shape of the river profile and C has to do
with the width of the river.
This is adapted using a set of stagedischarge measurements.
With max. likelihoods estimation you get
infinite estimates for some datasets! The
If you stop the ML-optimization at any
given time, the fit is good but the
parameter values are unreasonable!
Q
b
h0
Datum, h=0
Frequentist estimation does not have a way to
code for what constitutes reasonable and
unreasonable parameter values.
Bayesian statistics, on the other hand…
Schools of statistics –
Bayesian statistics
Everything about our knowledge
concerning unknown quantities
(parameters and models) is
handled using probability theory.
Prior distribution
Data likelihood
f (D |  )
f ( | D) 
f ( )
f ( D)
Posterior distribution
Central in this is Bayes formula.
When using it for parametric inference, Bayes formula allows you to switch
between the distribution of data given the parameters (the likelihood) and
the distribution of the parameters given the data (the posterior
distribution).
Estimates (single values) are no longer the focus, the posterior distribution
is! If you want them, you can take the mean, median or mode (peak) of the
posterior distribution to use as estimates.
Bayesian statistics –
a medical warm-up
Imagine a sickness with a medical tests that always gives positive indication if
you have that sickness.
It’s is quite accurate, only giving false positives in 1% of the cases where the
patient doesn’t have the sickness.
The sickness is however rare, only one in a thousand has it.
If you test positive, what is the probability that you have the sickness?
Pr(sick | positive test)  Pr(positiv e test | sick)Pr(si ck)/Pr(pos itive test) 
Pr(positiv e test | sick)Pr(si ck)

Pr(positiv e test | sick)Pr(si ck)  Pr(positiv e test | healthy)Pr (healthy)
100% * 0.1%
 9%
100% * 0.1%  1% * 99.9%
There is thus only a 9% chance that you have the sickness, given that you test
positive! What is happening?
Bayesian statistics –
a graphic medical warm-up
One thousand people before the test, represented by small circles.
= Sick
= Healthy
Bayesian statistics –
a medical warm-up (3)
After the test, among the ones testing positive there
remains one sick and about 10 healthy persons:
= Sick
= Healthy
The probability that you have the sickness has increased
dramatically, but still ten out of eleven will be healthy
even though they have tested positive. Only about 9%
tested positive because they actually have the sickness.
A positive test is thus evidence (info increasing the
probability) for the sickness, but not so strong evidence
that we believe it’s more likely than not that you have the
sickness.
A naive frequentist doing model testing, would say that
the probability of testing positive when healthy (1%) is
less than the usual significance level (5%), And that all
who tested positive thus have the sickness with 95%
confidence. An experienced frequentist will call the state
of the patient a hidden variable rather than a parameter or
model, and then proceed using Bayes formula.
Prior knowledge – prior distribution
 A prior distribution should summarize the knowledge we have
concerning the model before the data arrives.
 Typically, one chooses a parametric distribution family first, from
convenience and from having the right characteristics with respect to the
nature of the parameter. Since these distributions again have
parameters, these are called hyperparameters. If one suspects this
choice can influence the results, one tried several candidate distributions
(robustness analysis).
 One then adapts the hyperparameters to whatever specific information
one has. For instance one can form a 95% credibility interval (an interval
encompassing 95% of the probability), by adjusting the parametric
distribution coding for the prior distribution.
 Common mistake: Looking at the data to determine what a reasonable
prior would be. This is prior-data feedback, and example of circular
reasoning. I can easily give unreasonable indications of uncertainty and
unreasonable model choices.
Prior knowledge – prior distribution (2)
Prior distributions are at first glance purely subjective, but can be made
acceptable to others by:
a. Incorporating common knowledge (including previous data)
concerning the field of interest (intersubjectivity).
b.
Look at the variations that are in nature itself. For instance, for
hydrological stations, what is the typical range of stage-discharge
rating curve parameters? Perhaps one can find “nature’s own prior
distribution”.
c.
Use so-called non-informative prior distributions.
PS: Should “distributions” are often not proper distributions. For
instance, there does not exist a probability distribution that gives
equal probability to all numbers on the real line. Still, improper prior
distributions can give proper posterior distributions.
PSS: Do not use this trick when doing model comparison!
Bayesian statistics – distributions
Bayes formula:
f ( D |  ) f ( )
f ( | D) 
f ( D)
(Only one single model here)
One starts off the analysis with two things:
1. A model that says how the data was produced and which parameters that characterizes
this distribution. This is the likelihood: f(D|).
2. A prior distribution, f(). Summarizes out pre-knowledge concerning the parameters.
From this, one can calculate the following:
• The posterior distribution: f(|D). This summarizes our state of knowledge after the data
has been handled. If you want estimates, you get it from this (means, medians or modus).
• Distributions of derived quantities: f (h( ) | D)   f (h( ) |  ) f ( | D)d
For instance: discharge at a given stage, when Q(h)=C(h-h0)b
• A prior prediction distribution, called the marginal likelihood or the model likelihood. f(D)
gives the probability of getting data outcomes unconditioned on the parameter values
(only conditioned on our pre-knowledge). Used in model comparison.
f ( D)   f ( D |  ) f ( )d
• A posterior prediction distribution, f(Dnew|D), the probability for new data outcomes, given
the old data. (This is an example of a derived quantity). This thus takes into account the
parameter uncertainty after the data has been handled.
f ( Dnew | D)   f ( Dnew |  , D) f ( | D)d
PS: A old posterior distribution will be the prior distribution when we want to handle new data.
The old posterior prediction distribution will be the new prior predictive distribution.
Bayesian statistics – comparison of probabilities
Bayes formula:
f ( | D, M ) 
f ( D |  , M ) f ( | M )
f (D | M )
We can see whether a parameter value increases in probability relative to another parameter value:
f (1 | D, M ) f ( D | 1 , M ) f (1 | M ) f ( D |  2 , M ) f ( 2 | M ) f ( D | 1 , M ) f (1 | M )

/


f ( 2 | D, M )
f (D | M )
f (D | M )
f ( D |  2 , M ) f ( 2 | M )
The parameter value1 increases in probability relative to 2 if f(D| 1,M)>f(D| 2,M), i.e. if the data is
more probable for parameter value 1 than for 2.
The same goes for models:
Pr( M 1 | D) f ( D | M 1 ) Pr( M 1 ) f ( D | M 2 ) Pr( M 2 ) f ( D | M 1 ) Pr( M 1 )

/


Pr( M 2 | D)
f ( D)
f ( D)
f ( D | M 2 ) Pr( M 2 )
A model increases in probability relative to another if the data is more probable (irrespective of the
parameter values) for that model than for the other, Pr(D|M1)>Pr(D|M2).
Most importantly: One do not gain anything from absolute probabilities. It’s only by comparing
probabilities that you learn something!
Bayesian statistics – model comparison
Technically, we do model comparison by using Bayes formula:
Pr(M | D) 
f (D | M) Pr(M)
f (D)
The engine in this inference is the marginal likelihood (prior predictive distribution)
f(D|M). When we compare these, we can get evidence for one model or the other.
Since prediction strength is the key, overcomplicated models (having larger
parameter uncertainties) are naturally penalized without having to do any extra
work!
Ex: Extrasensory perception:
Using answers of whether the experimenter had his hand
over the right or left hand of the subject, gave 18 correct
answer out of 30 questions. Assuming independence of
answers, we get the binomial distribution with either p=0.5
(no), or unknown (yes) uniformly distributed success rate.
Can show that the prior predictive distribution is uniform
also, giving equal probability to all outcomes.
Any outcome between 11 and 19 will be evidence for p=0.5
(see plot), 18 correct answers are thus more likely with
random guessing than with extrasensory perception.
Prior predictive distribution for p=0.5
(red ) and p unknown (blue)
Bayesian model average
One can make distributions of any derived quantity,
unconditioned on the parameters (prior and in this case
posterior prediction distributions):
f (h( ) | D, M )   f (h( ) |  , M ) f ( | D, M )d
(From the law of total probability)
Example: Stage-discharge rating curve conditioned only on
the data and the number of segments, not the rating curve
parameters.
In the same fashion, can find the distribution of a derived
quantity even unconditioned on the model:
f (h | D)   f (h( ) | D, M ) Pr( M | D)
Example: The stage-discharge rating curve given the data
(but not conditioned on the parameters nor the number of
segments).
Bayesian vs frequentist –
the pragmatic aspect
When the model complexity is below a certain threshold, frequentist
methods are typically easier. Above that threshold, Bayesian
methods become easier.
Work
Frequentist
Bayesian
Complexity
Simulation and the law of large numbers
Assume you are interested in the properties of a stochastic variable
(probabilities, mean, quantiles, standard deviation etc). Assume further that
you can calculate these things analytically. What you however can do is to
sample from that variable.
With enough samples (an ensemble), you can estimate probabilities,
means, quantiles and standard deviations. Ex:
 Calculate the probability of getting yatzi from an algorithm for handling
dice throws and the rules of yatzi.
 Estimate the probability of an error situation in a production system,
given the error rates of each component of that system.
 Calculate the expected discharge from an ensemble of equally probable
weather forecasts.
 Find the number of data necessary to decrease the uncertainty of a
parameter below a given value with a given probability.
 Find the properties of the posterior distribution given samples from it (via
MCMC sampling).
Bayesian statistics – numerical methods:
MCMC
Reminder, Bayes formula (for only one model):
f ( D |  ) f ( )
Marginal distribution: f ( D)   f ( D |  ) f ( )d
f ( | D) 
This rascal is problematic. Not all integrals can be
f ( D)
calculated analytically.
A normalization constant is a number in a distribution which do not depend
on whatever you are taking the distribution over (in this case the parameter
set, ). In this case, f(D) is an unknown normalization constant.
A Markov chain (more about that later) is a time series where the values
”now” depend only on the previous value. Some such time series stabilize
to some distribution when running for enough time
It is possible to make a Markov chain that has the stationary distribution
equal to the distribution you’re after, without knowing the normalization
constant. This is called MCMC (Markov chain Monte Carlo).
WinBUGS is a system which automatically runs MCMC sampling given a
model, a prior distribution and the data (Alt: Make your own MCMC module
in R).
Bayesian statistics – more MCMC
Generally, an MCMC routine goes like this:
1. Make a starting parameter set, old.
2. Find a way (a proposal distribution*) to sample a new parameter set
given the old: new~g(new| old)
f ( new | D) g ( new |  old )
/
3. Accept the new parameter set with probability
f ( old | D) g ( old |  new )
use the old set if not.
PS: Normalization
4. Go back to 2 as many times as you want
disappears
spacing
Burn-in
* The proposal distribution determines how efficient the algorithm is.
Important concepts:
Burn-in: Number of
samples needed before
the time series converges
towards the stationary
distribution.
Spacing: Number of
samples needed before
you can keep one as an
approximately
independent.
Regression
 Regression is when one stochastic variable (the
response) depends on other variables (covariates /
explanation variables).
 A part of the variation in the response variable is thus
explained by the variation in the other variables.
weight
Example: Body weight
(response) versus height
(covariate)
height
Linear regression
 A linear regression examines the linear relationship
between the response and one or more covariates:
Y=0+1x1+2x2+…+pxp
• Note that the model is linear in the regression
parameters, 0,…,p, but not necessarily in the covariates.
So the model Y= 0+1x+2x2 is a linear model.
• The statistical model behind this is the following:
Yi   0  1 x1,i   2 x2,i  ...   p x p ,i   i where  i ~ N (0,  )
is independent noise.
Linear regression –
example with only one covariate
weighti  a  b * height i   i
weight
The regression parameters, a and b,
can be fitted to the data using for
instance ML-estimation.
The graph shows the adapted
regression.
The model is weird though, since it
allows for negative expected weights
and weight measurements (because
of the assumption of normality).
height
weight
One can save the situation by doing
a log-transform on both response
and covariate. This means a powerlaw on the original scale:
weight i  A * height b i * Ei where Ei ~ log N (0,  )
height
Linear regression with
only one covariate
A regression with only one covariate is easy to represent both
mathematically, Y=+x, and graphically. Some terminology:
weight
• Fitted responses are when you use the
regression line for the actual data.
• A residual is the difference between
actual and fitted response.
Actual
response
Fitted
response
residual
x
height
For a single covariate, the correlation between actual and fitted response is
equal to the correlation between (actual) response and covariate.
The regression coefficient is related to the correlation in a simple manner:
sd (Y )
   XY
sd ( X )
This also goes for the data estimates (estimated regression parameter vs
empirical correlation and empirical standard deviations).
Multivariate linear regression –
an example
If you have more than one covariate, this is not a problem for linear
regression (though presenting the results in a graph is then problematic).
Example: Tree volume as a function of tree height and
diameter. If we log-transform everything and do a linear
regression, log(Vi)=0+1log(Hi)+2log(Di)+i, that’s the
same as searching for the following expression on the
original scale: V i CH i1 Di 2 * Ei
Diameter
Height
In R we get the following output:
0
1
2
Estimate Std. Error t value Pr(>|t|)
(Intercept) -6.64580 0.81473 -8.157
9.23e-09 ***
ld
1.98982 0.08026 24.793 < 2e-16 ***
lh
1.11597 0.20791 5.368
1.14e-05 ***
--Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 0.08275 on 27 degrees of freedom
Multiple R-squared: 0.975,
Adjusted R-squared: 0.9731
F-statistic: 526.4 on 2 and 27 DF, p-value: < 2.2e-16
This tells us that the estimated relationship is:
log(V)=-6.64+1.99*log(H)+1.12*log(D) or
V(D,H)D1.99*H1.12.
Multivariate linear regression –
more on the R output
Covariates
(really
parameters)
Standard errors (standard deviation of estimator)
ML estimates
0
1
2
t-value=estimate/standard error =
how many standard deviations away
from 0 is the estimate
Estimate Std. Error t value Pr(>|t|)
(Intercept) -6.64580 0.81473 -8.157
9.23e-09 ***
logdiameter 1.98982 0.08026 24.793 < 2e-16 ***
logheight
1.11597 0.20791 5.368
1.14e-05 ***
--Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 0.08275 on 27 degrees of freedom
Multiple R-squared: 0.975,
Adjusted R-squared: 0.9731
F-statistic: 526.4 on 2 and 27 DF, p-value: < 2.2e-16
R-squared is often called ”goodness of fit”. It is
the squared correlation between fitted response
and actual response. It is also the amount of
variance in the response explained by the
regression. The closer to 1 this is, the better the
fit.
P-value for hypothesis, =0
Test of variance (ANOVA) of
whether there is anything
significant in the relationship
between response and covariates.
Conclusion:
1=1=0 can be ruled out here. But 1 is
very near 2 and 2 is less than a
standard error away also. Thus the
geometrical relationship VD2H can
not be ruled out.
ANOVA
Analysis of variance is performed in order to
a) Check whether a continuous outcome is different for
different categories (one-way ANOVA), when you
have a single set of categories.
b) If it depends linearly or non-linearly (interaction) on
two sets of categories (two-way ANOVA).
Technically it is a sub-method of linear regression,
where all covariates are discrete. The tests are
performed by comparing various ways to estimate
the variance from residuals and from category
differences.
Linear regression – What happens
when one run amok in covariates
With the possibilities in linear regression, one can be
tempted to just add more and more covariates. The fit will
always improve.
weight
As an example, let’s put some
higher order polynomial terms in the
weight-to-height regression:
vi  0  1hi   2 hi   2 hi   4 hi   i
2
3
4
The fit will improve, but the ability of
the regression to predict new data
can easily decline. The relationship
becomes more and more chaotic,
because the parameter
uncertainties are increasing.
height
How to avoid running amok?
There are three strategies to avoid running amok in covariates:
1. Thing about the nature of the data (are the quantities strictly positive) and
what you want to do with your regression.
2. Use hypothesis testing or other model choice techniques to limit the
complexity. (PS: R reports p values for all regression parameters).
Point 2 can be done by…
a)
b)
c)
d)
starting with a simple model and add the most significant covariates until no
significant covariates remain.
starting with a sufficiently complex model and remove the most insignificant
covariates until only significant covariates are left.
running through all regression models and calculate information criteria. (Not
recommended when the number of possible covariates is large.)
using Bayesian methodology (similar to point a, b, c).
Uncertainty
The estimators in the regression comes with a certain
uncertainty (standard error in frequentist theory or
posterior distribution in Bayesian). This is reported by R.
weight
Prediction uncertainty
Estimation
uncertainty
When the confidence interval of a parameter
encompasses zero, one cannot reject the hypothesis that
the corresponding covariate has no effect.
Uncertainty in regression parameters affect the uncertainty
of the expected response as a function of the covariate(s).
Yˆ  ˆ0  ˆ1 x1  ˆ2 x2  ...  ˆ p x p
height
Prediction uncertainty
Predictions of new measurements have in addition an
uncertainty in the measurement noise also:
Yˆpred  ˆ0  ˆ1 x1  ˆ2 x2  ...  ˆ p x p  
It is therefore important to separate between estimation
uncertainty and prediction uncertainty in regression!
Simulated
dataset
Residuals
A residual is the difference between actual response and the fitted
response. It’s an estimate of the noise term. The residuals can give a hint
about whether the model assumptions are valid or not.
1.
2.
3.
4.
A clear trend in the residuals against
any covariate show that the function
itself is wrong.
A clear trend of the residuals in time
suggest that one is dealing with time
series or that there are gradual
changes in important unmeasured
covariates.
If the residuals do not appear to be
normally distributed, a transformation
might be in order or a completely
different type of regression may be
needed.
If the variation in the residual has a
trend (heteroscedasticity), the noise
terms are wrongly. Remodelling or
data transformation might be
necessary.
Data+
regression
Data+
regression
residuals
residuals
QQ plot
Data+
regression
residuals
Non-normal regression –
Generalized linear models
Sometimes the nature of the response is such that the normal distribution just isn’t
appropriate. The prime example of this is with counting data.
• If your response is of the type “k outcomes of a particular type out of n trials for
covariate x”, then a binomial model for the response is appropriate.
• If your response is of the type “k outcomes for covariate x” (no upper limit for k), then
a Poisson model can be appropriate.
GLM models are made first by assigning a distribution (normal, binomial, Poisson).
The you transform the salient parameter of that distribution (expectancy, success rate,
rate) to something that can take values on the real line (this is called the link). That
transformed parameter is then given as a linear model, 0+1x1+2x2+…+pxp.
Since this type of analysis is so common, it has a name (GLM) and ready-made
methods in R (called ‘glm’).
GLM with a binomial model is often called “logistic regression” (due to the standard
transformation type), while GLM with the Poisson model is called “Poisson
regression”.
Non-linear regression
Sometimes it’s simply not reasonable to have a linear relationship
between response and covariates. The nature of the data might suggest
a different form.
An example is stage-discharge rating curves with unknown zero plane
Q=C(h-h0)b
If h0 was known, a log-transform would make this into a linear
relationship. But when you don’t have h0, then this equation will also be
non-linear:
q=a+b*log(h-h0)
ML optimization is still possible, but only with numerical methods. In
rating curve analysis, you can actually solve for a and b analytically, so
that only h0 is optimized numerically.
For more complicated models, sophisticated optimization methods or
MCMC may be necessary. One danger with non-linear regressions is that
the likelihood can have multiple peaks (multimodality). This is the case
for multi-segmented rating curves.
Rating curve estimation at Gryta
Let’s look at the station Gryta, without assuming h0=0.
We can use ”brute force”, by looking at an interval of possible h0 values
going, hm, to hm-100m in steps of 1cm.
Looks like we can optimize the loglikelihood (and thus the likelihood) with
a value for h0 close to zero.
A closer looks reveals
that the optimal h0 is
+8cm.
Please note the previously mentioned phenomena that some likelihoods
get better the lower values you have for h0.
Bayesian regression
Let’s take another look at the station Gryta. Under Bayesian regression, a preknowledge is assumed to exist. This can be retrieved from the collection of
previously made rating curves (”nature’s prior). But for Gryta, we know that the
datum is set so that h00 and since it’s a weir with a V notch, we know that b  2.5
ought to be approximately true (from hydraulic theory).
In VFKURVE3, one sets the
prior distribution (or the
hyperparameters) in a separate
window.
Note that in Bayesian statistics,
there are fewer problems
concerning the handling of
multimodality. Simulation from the
posterior distribution becomes
slightly more difficult, but there are
efficient ways of dealing with the
problem.
Bayesian regression (2)
When one performs the analysis, the result is
a lot of samples form the posterior
distribution. In addition to estimates, you also
get a notion of the parameter uncertainty.
For parameters where we have assigned a
sharp pre-knowledge with most of the
probability mass within a small interval, the
posterior distribution will typically be inside
that interval also. (If not, we have prior-data
conflict).
Since the parameters has a distribution then
so also does the rating curve.
With lots of data and/or good prior knowledge, the
curve uncertainty can get quite small.
Generalized additive models
Generalized additive models are models
where the response is explained by the
added effect of functions of each
covariate:
y   0  g1 ( x1 )    g k ( xk )
The functions are not known but can be
arbitrarily complicated splines. A
penalty term for spline complexity is
added to the likelihood.
This makes this a borderline Bayesian
inference, since a penalty terms
function functions in all respects like a
prior distribution.
In R this is implemented as ’gam’ in the ’mgcv’ library. gam(y~x1+s(x2))
Says that covariate x1 will be included linearly while covariate x2 will be given a
generalized additive treatment.
When the number of covariates is high
compared to the number of data
Sometimes we know that there ought to be a relationship between response y
and covariates x1,…,xk. But if the number of measurements is low, it can be
difficult to get reliable estimates for the regression. When n<k, it’s even
impossible!
There are several ways out of this:
• Principal component regression (PCR). Do a principal component analysis
on the covariates (which decomposes the variance into the most important
direction, the second most important etc). Use these components as
covariates, adding one and one component at a time.
• Partial Least Squares (PLS): Similar to PCR but decomposes the covariates in
components that are also correlated to the response.
• Ridge regression. Perform the regression with a penalty term proportional to
the sum of the square regression parameters. Equivalent to having a Bayesian
normally distributed prior.
• Lasso regression. Perform the regression with a penalty term proportional to
the sum of the absolute values of the regression parameters. Equivalent to
having a Bayesian exponentially distributed prior.
• Bayesian regression with informative priors.
Regression between time series –
a bad idea
If we wish to do regression of (for instance) a discharge time series, with another such
series as covariate, we run into difficulties. The model assumptions behind regression
(more specifically that of independent noise in the response) is typically no longer
available. The estimates will be unbiased, but uncertainty and model choice criterions
can be extremely erroneous. Typically, the uncertainty will be severely underestimated,
because we are pretending to have much more independent information that we’ve
actually got.
Here are two independently
simulated time series. If we
plot one against the other,
they might look like they are
dependent. A linear
regression ”confirms” this.
But this is caused by both
series being dependent in
time.
Result from R: summary(lm(x2~x1)): x1
-0.47232 0.04747 -9.95 < 2e-16 ***
But we know the series are not correlated! Ways of dealing with this: Take averages over
large enough time span that the time correlation disappears or do time series analysis.
Time series analysis
Statistical time series are data in time, where there is some kind of
dependency between what happens at a point in time and what happens
next.
Examples: discharge, reservoir volume, sediment transport,
precipitation…
PS: If you look at such data with coarse enough time resolution, the
dependencies might become negligible, but then you might be left with
very little data!
If time dependency is not handled,
uncertainties tend to be underestimated
and model choice methods can’t be
trusted.
Important concept: Stationarity.
A time series model is stationary if all
marginal and joint probabilities are the
same independent of time.
When the model clashes with reality
– independent noise vs time series
Here I have simulated “water temperature” with
expectation =10.
Assume known standard deviation, =2. Wish to
estimate  and test =10.
• Model 1, independence: Ti=+i, i~N(0,1) i.i.f.
 The graph disagrees with this assumption…
 Estimate: ˆ  x  11.4, sd ( ˆ )  s / n  0.2
 95% conf. int. for : (11.02,11.80). =10 rejected
with 95% confidence!
• Model 2, autoregressive model with expectation , standard deviation 
and auto-correlation a.
 Linear dependency between temperature one day and the next.
1 a
 Estimate: ˆ  x  11.4, sd ( ˆ )  s
 1.4
n 1 a
 95% conf. int. for : (8.7,14.10). =10 not rejected.
Time series – diagnostic plots
1.
2.
3.
Auto-correlation. This is a plot that
shows the correlation between the
value at one time step and the next, the
second next, the third next etc (this is
called the lag). Usually this will
decrease with the lag, but seasonality
can cause problems. Winter value are
typically negatively correlated to
summer values and positively
correlated to values the previous winter.
Cross-correlation plots. When you want
to see the linear dependency of one
time series on with another.
Fourier analysis. This decomposes a
time series into sine/cosine-functions
with different periodicity. Time series
with seasonality will have a strong top
for the year period.
Diagnostics and seasonality
Many hydrological time series has seasonality. One should however
be able to ask what the nature of the time series is after one has
taken this into account.
In the START system, there is an option called ”avvik fra normal
årsvariasjon”, which subtracts the yearly mean and divides by the
yearly standard deviation. Thus one can look at (and model) the autocorrelation after this deterministic trend has been removed.
Without such an operation, and analysis of temperature data will
typically give a characteristic correlation time (the time for the
correlation to drop by a factor of 1/2) of several years. After the
operations, this characteristic time will typically be in a manner of
days or weeks instead. What this means is that the information that it
was unusually hot for the season a couple of weeks ago gives little
information about what to expect today.
The standard time series tool box:
ARIMA models
There exists an arsenal of statistical time series models called the ARIMA
models. These models are made by combining auto-regression (AR), integration
(I) and moving averages (MA).
AR models: These are models where the next value depends linearly on a set of
previous values. For instance, in the AR(1) model, one value depends on it’s past
only through the previous value (this is also known as a Markov chain):
xt  xt 1  (1   )    t where  t ~ N (0,1) is independen t noise
MA models: Models based on moving averages of noise:
xt   t  1 t 1  ...   p t  p where  t ~ N (0,  ) is independen t noise
Integrated models: Instead of modelling the original time series, it is the
difference from one time step to the next that is modelled: yt  xt  xt 1 . This is
done for time series that are not stationary, in the hope that this will render the
model stationary.
Season dependency: There is also a seasonal ARIMA, where the usual ARIMA
terms looks at values one or more years back in time rather than one or more
time steps.
More diagnostics
An MA model will have auto-correlation plots that suddenly dies out
for lags beyond the size of the moving average window. So, if the
auto-correlation completely disappears after k time steps, one have
a MA(k) model.
An AR model can be examined by a similar plot where the autocorrelation for one lag is removed before looking at the next.
This is a called a partial
auto-correlation plot
(pacf). It will suddenly
drop to zero after k lags
for a AR(k) series.
Here is an example,
using an AR(1) model:
Stochastic processes
 Processes are collections of stochastic variables that has dependency
structure among themselves and that are ordered chronologically in
time. Processes forms statistical models for time series.
 Ex: Water temperature, discharge, precipitation, flood events, a series
of dice throws, the number of wolves in Norway, the evolution of the
size of an organism, the organization structure of NVE.
 Some processes can be natural to model in discrete time (annual
discharge maxima, dice throws). Other processes might be more
natural to model in continuous time (discharge, the number of
wolves in Norway, the evolution of the size of an organism). Some
times you can choose whatever you feel is most convenient.
 Just as distributions can have free parameters, so can processes
(think back to the examples of the Bernoulli and Poisson process).
Most of the usual suspects of the distribution families are associated
with various processes.
Time series modelling –
Markov chains
A Markov chain is a process where the state at one time depends on
previous history only through the most recent past state:
x1
x2
x3
x4
x5
x6
…..
xn
I would argue that if you do not have a Markov chain model, you have not
sufficiently described your state space. (For instance, if you want to model
the position of a particle as a function of time, you don’t want to model just
the position but also the velocity.)
If you start with the general expression for the likelihood of dependent stuff,
this simplifies considerably with Markov chains:
f ( x1 , x2 , x3 , x4 ,, xn ) 
f ( x1 ) f ( x2 | x1 ) f ( x3 | x1 , x2 ) f ( x4 | x1 , x2 , x3 )  f ( xn | x1 , x2 , x3 , x4 ,, xn 1 ) 
f ( x1 ) f ( x2 | x1 ) f ( x3 | x2 ) f ( x4 | x3 )  f ( xn | xn 1 )
If the transition probabilities, f(xt|xt-1), are the same for all time points, then
the likelihood simplifies even more. If also the marginal distribution f(xt)
stays the same, the process is stationary.
Hidden Markov chains

Hydrological and meteorological states in nature have an element
of stochasticity (non-predictability). Hopefully, they can be
modelled as Markov chains (with enough relevant state variables).
L
D
However, the data we receive are not directly the state of nature,
but fallible measurements on these states. The state itself is thus
a hidden (latent) set of variables.
Assuming independent measurement noise, the dependency
structure looks like this:
time
State:
Observations:
x1
x2
x3
xn
y1
y2
y3
yn
Example of discrete time Markov chains
• Random walk: xt=xt-1+t where t is independent
noise(t~N(,) typically). Note that this
process is not stationary, since we are all the
time adding noise. The variance increases
linearly with time. Since t does not need to
have zero expectancy, one can also have a linear
trend in the expectancy of the process.
• Autoregressive model, AR(1); xt=(1-a)+axt-1+t
where t typically is standard normally
distributed and -1<a<1. if one starts off
with x1 ~ N ( , / 1  a 2 ) then the distribution at any
later time will be the same. The marginal
distribution will in any case converge towards
this
• Autoregressive model, AR(k), k>1:
xt=(1-a1-a2-…-ak)+a1xt-1+ a2xt-2+…+akxt-k+t
With some restrictions, this can also be a
stationary process. It is a Markov chain, since
(xt,…,xt-k+1) is expressed through(xt-1,…,xt-k). It is
also an example of a vector process.
Example of discrete time Markov chains (2)
• Correlated autoregressive processes: One can
expand AR(1) to a vector process of two or more
different processes having correlated noise:
x t    A( x t 1   )   t der  t ~ N (0, )
where A is a diagonal matrix with individual autocorrelations and  is a covariance matrix.
Her e x(black) and y(blue)
has noise correlation 0.8.
• Regressive (causal) cross terms:
xt   x  a x ( xt   x   ( yt   y ))   x t( x )
yt   y  a y ( yt   y )   x t( y )
• Both these and AR(k) can be generalized to:
x t    A( x t 1   )   t der  t ~ N (0, )
where A is now a general matrix.
x=black, y=blue. Note that y must spend
some time above it’s expectation before
x reacts by climbing up.
Continuous time Markov chains – stochastic
differential equations
A differential equation gives you a function in
time. A stochastic differential equation is
similar but has some elements of stochasticity
(thus making a stochastic process in
continuous time).
From this you can make continuous time
expansions of what was seen on the previous
slides (see figures).
Mathematically, the continuous time parent of
the AR(1) model (called the OrnsteinUhlenbeck process) looks like this:
dx(t )  ( x(t )   )dt / t   2 / t dBt
Wiener process (random walk)
1.96 

t
Ornstein-Uhlenbeck process
-1.96 
Correlated OU.
Causal model (black
reacts to red)
Hidden Markov chains (2)
time
State:
Observations:
x1
x2
x3
xn
y1
y2
y3
yn
Hidden Markov chains have two ingredients:
a) The System equation (SE), telling you how the hidden Markov chain works;
f(xk|xk-1)  k.
b) The observational equation (OE), which tells how the observations are related to
the state f(yk|xk).
Starting at the start and progressively using the SE, the OE, the law of total probability
and Bayes formula, you can get inference for the state given the observations so far
and also the likelihood. This is called filtering.
Once you have that, you can also work backwards and get inference for the state
given all the observations.
The Kalman filter
A Kalman filter is a way of analytically do all the filtering work analytically, but only if you
have normal observational noise and normal SE with linear updates:
a) SL: x k  Fk x k 1  m k 1   k where  k ~ N (0, Qk )
where  k ~ N (0, Rk )
b) OL: yk  H k x k   k
Note that all the models I have outlined so far, is on this form!
All the steps in the filtering can be done analytically. Keep in mind that the normal
distribution is specified by it’s expectancy and variance, so only that is needed.
1. You can calculate the mean and variance of xk|yk-1,…,y1,
The first
given the filtering from the previous time step, k-1.
application of the
2. You can then find the mean and variance of the next
Kalman filter was
observation, given the previous ones: yk|yk-1,…,y1. This
the Apollo project!
gives you the update likelihood.
3. You can then find the mean and variance of the state
given all the observations including this one: xk|yk,…,y1.
You then step back and repeat for observations k+1. The
expressions for these means and a variances can be found in
many text books, on Wikipedia and in my previous NVE
course.
Example of using the Kalman filter and
Kalman smoother for interpolation
In this example a set of very close temperature series has
been fetched. The correlated noise model is used for the
hidden Markov chain.
Some data was artificially removed (so I have the true
values), in order for this method to interpolate the values. The
Kalman filter was used for calculating likelihoods and thus
finding the best parameter set(s).
The plots show the Kalman smoothing, both the expectancy and the
credibility intervals (thus the variance). Since the model allows
correlations, one station gives information about the other. it can deal
with different stations falling out at different times. Two or even all
three stations can be out at any given time. Where all stations have
fallen out, the uncertainty ”bubbles out”.
Non-linear models and partikkel
filters
A better model for discharge than a
linear model can be made by a (simple)
hydrological model. Assume are
humidity to be a linear OU process +
seasonaily and area precipitation a
thresholded version of that. Put that as
input into a lake and use a rating curve
+ mass balance to get discharge out.
This model can be reduced to two
component, a linear stochastic
differential equation and a non-linear
ordinary differential equation.
Non-linear models can’t be solved by
the Kalman filter. Alternatives: particle
filter (general method based on
simulation), extended Kalman filter
(linearization).
When using particle filtering on
Farstad, the first artificial gap (see extra
exercise 7) is filled out in a much more
reasonable manner.
This is heavy machinery, though!
Spatial models and time-space fields
Interpolation and extrapolation is something
that is wanted in space as well as time.
If you have a model for spatial dependencies,
you can adapt it to data using statistics,
estimate unmeasured places and assess the
uncertainty of such estimates.
Such models can be discrete or continuous.
”Kriging” is an often used method, where you
assume a function form of the
dependency-structure (by use of so-called
semivariograms) and bin and count your
data according to comparison between
model and data. You then do regression
between the derived data and your model.
Alternative: ML or Bayesian analysis on the
dependency structure itself. (INLA)
Expansion: space-time fields, where you are
looking at dependencies both in space and
time. Thus you can interpolate
unmeasured combination of space and
time. (Ex: precipitation, temperature).
Download