How to get data and model to fit together? Experimental planning Statistical methods can be used in order to better plan ones experiments: • You might want to be able to detect (with 95% confidence) a difference larger than a given value between two sets of mean precipitations with a given probability (power). If so, plot the power as a function of the data size and see when the rejection probability (power) goes above your wanted level. • If you want to do enough stage-discharge measurements that the uncertainty of the rating curve exponent (b) is with 90% probability less than 0.1, you either need an analytical expression for the uncertainty as a function of the data size and the parameter values, or you need to do simulation. When model + methodology clashes with reality Wish to have the relationship between stage (h) and discharge (Q) on the following form: C h Q=C(h-h0)b where h0 is the zero plane, b gives the shape of the river profile and C has to do with the width of the river. This is adapted using a set of stagedischarge measurements. With max. likelihoods estimation you get infinite estimates for some datasets! The If you stop the ML-optimization at any given time, the fit is good but the parameter values are unreasonable! Q b h0 Datum, h=0 Frequentist estimation does not have a way to code for what constitutes reasonable and unreasonable parameter values. Bayesian statistics, on the other hand… Schools of statistics – Bayesian statistics Everything about our knowledge concerning unknown quantities (parameters and models) is handled using probability theory. Prior distribution Data likelihood f (D | ) f ( | D) f ( ) f ( D) Posterior distribution Central in this is Bayes formula. When using it for parametric inference, Bayes formula allows you to switch between the distribution of data given the parameters (the likelihood) and the distribution of the parameters given the data (the posterior distribution). Estimates (single values) are no longer the focus, the posterior distribution is! If you want them, you can take the mean, median or mode (peak) of the posterior distribution to use as estimates. Bayesian statistics – a medical warm-up Imagine a sickness with a medical tests that always gives positive indication if you have that sickness. It’s is quite accurate, only giving false positives in 1% of the cases where the patient doesn’t have the sickness. The sickness is however rare, only one in a thousand has it. If you test positive, what is the probability that you have the sickness? Pr(sick | positive test) Pr(positiv e test | sick)Pr(si ck)/Pr(pos itive test) Pr(positiv e test | sick)Pr(si ck) Pr(positiv e test | sick)Pr(si ck) Pr(positiv e test | healthy)Pr (healthy) 100% * 0.1% 9% 100% * 0.1% 1% * 99.9% There is thus only a 9% chance that you have the sickness, given that you test positive! What is happening? Bayesian statistics – a graphic medical warm-up One thousand people before the test, represented by small circles. = Sick = Healthy Bayesian statistics – a medical warm-up (3) After the test, among the ones testing positive there remains one sick and about 10 healthy persons: = Sick = Healthy The probability that you have the sickness has increased dramatically, but still ten out of eleven will be healthy even though they have tested positive. Only about 9% tested positive because they actually have the sickness. A positive test is thus evidence (info increasing the probability) for the sickness, but not so strong evidence that we believe it’s more likely than not that you have the sickness. A naive frequentist doing model testing, would say that the probability of testing positive when healthy (1%) is less than the usual significance level (5%), And that all who tested positive thus have the sickness with 95% confidence. An experienced frequentist will call the state of the patient a hidden variable rather than a parameter or model, and then proceed using Bayes formula. Prior knowledge – prior distribution A prior distribution should summarize the knowledge we have concerning the model before the data arrives. Typically, one chooses a parametric distribution family first, from convenience and from having the right characteristics with respect to the nature of the parameter. Since these distributions again have parameters, these are called hyperparameters. If one suspects this choice can influence the results, one tried several candidate distributions (robustness analysis). One then adapts the hyperparameters to whatever specific information one has. For instance one can form a 95% credibility interval (an interval encompassing 95% of the probability), by adjusting the parametric distribution coding for the prior distribution. Common mistake: Looking at the data to determine what a reasonable prior would be. This is prior-data feedback, and example of circular reasoning. I can easily give unreasonable indications of uncertainty and unreasonable model choices. Prior knowledge – prior distribution (2) Prior distributions are at first glance purely subjective, but can be made acceptable to others by: a. Incorporating common knowledge (including previous data) concerning the field of interest (intersubjectivity). b. Look at the variations that are in nature itself. For instance, for hydrological stations, what is the typical range of stage-discharge rating curve parameters? Perhaps one can find “nature’s own prior distribution”. c. Use so-called non-informative prior distributions. PS: Should “distributions” are often not proper distributions. For instance, there does not exist a probability distribution that gives equal probability to all numbers on the real line. Still, improper prior distributions can give proper posterior distributions. PSS: Do not use this trick when doing model comparison! Bayesian statistics – distributions Bayes formula: f ( D | ) f ( ) f ( | D) f ( D) (Only one single model here) One starts off the analysis with two things: 1. A model that says how the data was produced and which parameters that characterizes this distribution. This is the likelihood: f(D|). 2. A prior distribution, f(). Summarizes out pre-knowledge concerning the parameters. From this, one can calculate the following: • The posterior distribution: f(|D). This summarizes our state of knowledge after the data has been handled. If you want estimates, you get it from this (means, medians or modus). • Distributions of derived quantities: f (h( ) | D) f (h( ) | ) f ( | D)d For instance: discharge at a given stage, when Q(h)=C(h-h0)b • A prior prediction distribution, called the marginal likelihood or the model likelihood. f(D) gives the probability of getting data outcomes unconditioned on the parameter values (only conditioned on our pre-knowledge). Used in model comparison. f ( D) f ( D | ) f ( )d • A posterior prediction distribution, f(Dnew|D), the probability for new data outcomes, given the old data. (This is an example of a derived quantity). This thus takes into account the parameter uncertainty after the data has been handled. f ( Dnew | D) f ( Dnew | , D) f ( | D)d PS: A old posterior distribution will be the prior distribution when we want to handle new data. The old posterior prediction distribution will be the new prior predictive distribution. Bayesian statistics – comparison of probabilities Bayes formula: f ( | D, M ) f ( D | , M ) f ( | M ) f (D | M ) We can see whether a parameter value increases in probability relative to another parameter value: f (1 | D, M ) f ( D | 1 , M ) f (1 | M ) f ( D | 2 , M ) f ( 2 | M ) f ( D | 1 , M ) f (1 | M ) / f ( 2 | D, M ) f (D | M ) f (D | M ) f ( D | 2 , M ) f ( 2 | M ) The parameter value1 increases in probability relative to 2 if f(D| 1,M)>f(D| 2,M), i.e. if the data is more probable for parameter value 1 than for 2. The same goes for models: Pr( M 1 | D) f ( D | M 1 ) Pr( M 1 ) f ( D | M 2 ) Pr( M 2 ) f ( D | M 1 ) Pr( M 1 ) / Pr( M 2 | D) f ( D) f ( D) f ( D | M 2 ) Pr( M 2 ) A model increases in probability relative to another if the data is more probable (irrespective of the parameter values) for that model than for the other, Pr(D|M1)>Pr(D|M2). Most importantly: One do not gain anything from absolute probabilities. It’s only by comparing probabilities that you learn something! Bayesian statistics – model comparison Technically, we do model comparison by using Bayes formula: Pr(M | D) f (D | M) Pr(M) f (D) The engine in this inference is the marginal likelihood (prior predictive distribution) f(D|M). When we compare these, we can get evidence for one model or the other. Since prediction strength is the key, overcomplicated models (having larger parameter uncertainties) are naturally penalized without having to do any extra work! Ex: Extrasensory perception: Using answers of whether the experimenter had his hand over the right or left hand of the subject, gave 18 correct answer out of 30 questions. Assuming independence of answers, we get the binomial distribution with either p=0.5 (no), or unknown (yes) uniformly distributed success rate. Can show that the prior predictive distribution is uniform also, giving equal probability to all outcomes. Any outcome between 11 and 19 will be evidence for p=0.5 (see plot), 18 correct answers are thus more likely with random guessing than with extrasensory perception. Prior predictive distribution for p=0.5 (red ) and p unknown (blue) Bayesian model average One can make distributions of any derived quantity, unconditioned on the parameters (prior and in this case posterior prediction distributions): f (h( ) | D, M ) f (h( ) | , M ) f ( | D, M )d (From the law of total probability) Example: Stage-discharge rating curve conditioned only on the data and the number of segments, not the rating curve parameters. In the same fashion, can find the distribution of a derived quantity even unconditioned on the model: f (h | D) f (h( ) | D, M ) Pr( M | D) Example: The stage-discharge rating curve given the data (but not conditioned on the parameters nor the number of segments). Bayesian vs frequentist – the pragmatic aspect When the model complexity is below a certain threshold, frequentist methods are typically easier. Above that threshold, Bayesian methods become easier. Work Frequentist Bayesian Complexity Simulation and the law of large numbers Assume you are interested in the properties of a stochastic variable (probabilities, mean, quantiles, standard deviation etc). Assume further that you can calculate these things analytically. What you however can do is to sample from that variable. With enough samples (an ensemble), you can estimate probabilities, means, quantiles and standard deviations. Ex: Calculate the probability of getting yatzi from an algorithm for handling dice throws and the rules of yatzi. Estimate the probability of an error situation in a production system, given the error rates of each component of that system. Calculate the expected discharge from an ensemble of equally probable weather forecasts. Find the number of data necessary to decrease the uncertainty of a parameter below a given value with a given probability. Find the properties of the posterior distribution given samples from it (via MCMC sampling). Bayesian statistics – numerical methods: MCMC Reminder, Bayes formula (for only one model): f ( D | ) f ( ) Marginal distribution: f ( D) f ( D | ) f ( )d f ( | D) This rascal is problematic. Not all integrals can be f ( D) calculated analytically. A normalization constant is a number in a distribution which do not depend on whatever you are taking the distribution over (in this case the parameter set, ). In this case, f(D) is an unknown normalization constant. A Markov chain (more about that later) is a time series where the values ”now” depend only on the previous value. Some such time series stabilize to some distribution when running for enough time It is possible to make a Markov chain that has the stationary distribution equal to the distribution you’re after, without knowing the normalization constant. This is called MCMC (Markov chain Monte Carlo). WinBUGS is a system which automatically runs MCMC sampling given a model, a prior distribution and the data (Alt: Make your own MCMC module in R). Bayesian statistics – more MCMC Generally, an MCMC routine goes like this: 1. Make a starting parameter set, old. 2. Find a way (a proposal distribution*) to sample a new parameter set given the old: new~g(new| old) f ( new | D) g ( new | old ) / 3. Accept the new parameter set with probability f ( old | D) g ( old | new ) use the old set if not. PS: Normalization 4. Go back to 2 as many times as you want disappears spacing Burn-in * The proposal distribution determines how efficient the algorithm is. Important concepts: Burn-in: Number of samples needed before the time series converges towards the stationary distribution. Spacing: Number of samples needed before you can keep one as an approximately independent. Regression Regression is when one stochastic variable (the response) depends on other variables (covariates / explanation variables). A part of the variation in the response variable is thus explained by the variation in the other variables. weight Example: Body weight (response) versus height (covariate) height Linear regression A linear regression examines the linear relationship between the response and one or more covariates: Y=0+1x1+2x2+…+pxp • Note that the model is linear in the regression parameters, 0,…,p, but not necessarily in the covariates. So the model Y= 0+1x+2x2 is a linear model. • The statistical model behind this is the following: Yi 0 1 x1,i 2 x2,i ... p x p ,i i where i ~ N (0, ) is independent noise. Linear regression – example with only one covariate weighti a b * height i i weight The regression parameters, a and b, can be fitted to the data using for instance ML-estimation. The graph shows the adapted regression. The model is weird though, since it allows for negative expected weights and weight measurements (because of the assumption of normality). height weight One can save the situation by doing a log-transform on both response and covariate. This means a powerlaw on the original scale: weight i A * height b i * Ei where Ei ~ log N (0, ) height Linear regression with only one covariate A regression with only one covariate is easy to represent both mathematically, Y=+x, and graphically. Some terminology: weight • Fitted responses are when you use the regression line for the actual data. • A residual is the difference between actual and fitted response. Actual response Fitted response residual x height For a single covariate, the correlation between actual and fitted response is equal to the correlation between (actual) response and covariate. The regression coefficient is related to the correlation in a simple manner: sd (Y ) XY sd ( X ) This also goes for the data estimates (estimated regression parameter vs empirical correlation and empirical standard deviations). Multivariate linear regression – an example If you have more than one covariate, this is not a problem for linear regression (though presenting the results in a graph is then problematic). Example: Tree volume as a function of tree height and diameter. If we log-transform everything and do a linear regression, log(Vi)=0+1log(Hi)+2log(Di)+i, that’s the same as searching for the following expression on the original scale: V i CH i1 Di 2 * Ei Diameter Height In R we get the following output: 0 1 2 Estimate Std. Error t value Pr(>|t|) (Intercept) -6.64580 0.81473 -8.157 9.23e-09 *** ld 1.98982 0.08026 24.793 < 2e-16 *** lh 1.11597 0.20791 5.368 1.14e-05 *** --Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 0.08275 on 27 degrees of freedom Multiple R-squared: 0.975, Adjusted R-squared: 0.9731 F-statistic: 526.4 on 2 and 27 DF, p-value: < 2.2e-16 This tells us that the estimated relationship is: log(V)=-6.64+1.99*log(H)+1.12*log(D) or V(D,H)D1.99*H1.12. Multivariate linear regression – more on the R output Covariates (really parameters) Standard errors (standard deviation of estimator) ML estimates 0 1 2 t-value=estimate/standard error = how many standard deviations away from 0 is the estimate Estimate Std. Error t value Pr(>|t|) (Intercept) -6.64580 0.81473 -8.157 9.23e-09 *** logdiameter 1.98982 0.08026 24.793 < 2e-16 *** logheight 1.11597 0.20791 5.368 1.14e-05 *** --Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 0.08275 on 27 degrees of freedom Multiple R-squared: 0.975, Adjusted R-squared: 0.9731 F-statistic: 526.4 on 2 and 27 DF, p-value: < 2.2e-16 R-squared is often called ”goodness of fit”. It is the squared correlation between fitted response and actual response. It is also the amount of variance in the response explained by the regression. The closer to 1 this is, the better the fit. P-value for hypothesis, =0 Test of variance (ANOVA) of whether there is anything significant in the relationship between response and covariates. Conclusion: 1=1=0 can be ruled out here. But 1 is very near 2 and 2 is less than a standard error away also. Thus the geometrical relationship VD2H can not be ruled out. ANOVA Analysis of variance is performed in order to a) Check whether a continuous outcome is different for different categories (one-way ANOVA), when you have a single set of categories. b) If it depends linearly or non-linearly (interaction) on two sets of categories (two-way ANOVA). Technically it is a sub-method of linear regression, where all covariates are discrete. The tests are performed by comparing various ways to estimate the variance from residuals and from category differences. Linear regression – What happens when one run amok in covariates With the possibilities in linear regression, one can be tempted to just add more and more covariates. The fit will always improve. weight As an example, let’s put some higher order polynomial terms in the weight-to-height regression: vi 0 1hi 2 hi 2 hi 4 hi i 2 3 4 The fit will improve, but the ability of the regression to predict new data can easily decline. The relationship becomes more and more chaotic, because the parameter uncertainties are increasing. height How to avoid running amok? There are three strategies to avoid running amok in covariates: 1. Thing about the nature of the data (are the quantities strictly positive) and what you want to do with your regression. 2. Use hypothesis testing or other model choice techniques to limit the complexity. (PS: R reports p values for all regression parameters). Point 2 can be done by… a) b) c) d) starting with a simple model and add the most significant covariates until no significant covariates remain. starting with a sufficiently complex model and remove the most insignificant covariates until only significant covariates are left. running through all regression models and calculate information criteria. (Not recommended when the number of possible covariates is large.) using Bayesian methodology (similar to point a, b, c). Uncertainty The estimators in the regression comes with a certain uncertainty (standard error in frequentist theory or posterior distribution in Bayesian). This is reported by R. weight Prediction uncertainty Estimation uncertainty When the confidence interval of a parameter encompasses zero, one cannot reject the hypothesis that the corresponding covariate has no effect. Uncertainty in regression parameters affect the uncertainty of the expected response as a function of the covariate(s). Yˆ ˆ0 ˆ1 x1 ˆ2 x2 ... ˆ p x p height Prediction uncertainty Predictions of new measurements have in addition an uncertainty in the measurement noise also: Yˆpred ˆ0 ˆ1 x1 ˆ2 x2 ... ˆ p x p It is therefore important to separate between estimation uncertainty and prediction uncertainty in regression! Simulated dataset Residuals A residual is the difference between actual response and the fitted response. It’s an estimate of the noise term. The residuals can give a hint about whether the model assumptions are valid or not. 1. 2. 3. 4. A clear trend in the residuals against any covariate show that the function itself is wrong. A clear trend of the residuals in time suggest that one is dealing with time series or that there are gradual changes in important unmeasured covariates. If the residuals do not appear to be normally distributed, a transformation might be in order or a completely different type of regression may be needed. If the variation in the residual has a trend (heteroscedasticity), the noise terms are wrongly. Remodelling or data transformation might be necessary. Data+ regression Data+ regression residuals residuals QQ plot Data+ regression residuals Non-normal regression – Generalized linear models Sometimes the nature of the response is such that the normal distribution just isn’t appropriate. The prime example of this is with counting data. • If your response is of the type “k outcomes of a particular type out of n trials for covariate x”, then a binomial model for the response is appropriate. • If your response is of the type “k outcomes for covariate x” (no upper limit for k), then a Poisson model can be appropriate. GLM models are made first by assigning a distribution (normal, binomial, Poisson). The you transform the salient parameter of that distribution (expectancy, success rate, rate) to something that can take values on the real line (this is called the link). That transformed parameter is then given as a linear model, 0+1x1+2x2+…+pxp. Since this type of analysis is so common, it has a name (GLM) and ready-made methods in R (called ‘glm’). GLM with a binomial model is often called “logistic regression” (due to the standard transformation type), while GLM with the Poisson model is called “Poisson regression”. Non-linear regression Sometimes it’s simply not reasonable to have a linear relationship between response and covariates. The nature of the data might suggest a different form. An example is stage-discharge rating curves with unknown zero plane Q=C(h-h0)b If h0 was known, a log-transform would make this into a linear relationship. But when you don’t have h0, then this equation will also be non-linear: q=a+b*log(h-h0) ML optimization is still possible, but only with numerical methods. In rating curve analysis, you can actually solve for a and b analytically, so that only h0 is optimized numerically. For more complicated models, sophisticated optimization methods or MCMC may be necessary. One danger with non-linear regressions is that the likelihood can have multiple peaks (multimodality). This is the case for multi-segmented rating curves. Rating curve estimation at Gryta Let’s look at the station Gryta, without assuming h0=0. We can use ”brute force”, by looking at an interval of possible h0 values going, hm, to hm-100m in steps of 1cm. Looks like we can optimize the loglikelihood (and thus the likelihood) with a value for h0 close to zero. A closer looks reveals that the optimal h0 is +8cm. Please note the previously mentioned phenomena that some likelihoods get better the lower values you have for h0. Bayesian regression Let’s take another look at the station Gryta. Under Bayesian regression, a preknowledge is assumed to exist. This can be retrieved from the collection of previously made rating curves (”nature’s prior). But for Gryta, we know that the datum is set so that h00 and since it’s a weir with a V notch, we know that b 2.5 ought to be approximately true (from hydraulic theory). In VFKURVE3, one sets the prior distribution (or the hyperparameters) in a separate window. Note that in Bayesian statistics, there are fewer problems concerning the handling of multimodality. Simulation from the posterior distribution becomes slightly more difficult, but there are efficient ways of dealing with the problem. Bayesian regression (2) When one performs the analysis, the result is a lot of samples form the posterior distribution. In addition to estimates, you also get a notion of the parameter uncertainty. For parameters where we have assigned a sharp pre-knowledge with most of the probability mass within a small interval, the posterior distribution will typically be inside that interval also. (If not, we have prior-data conflict). Since the parameters has a distribution then so also does the rating curve. With lots of data and/or good prior knowledge, the curve uncertainty can get quite small. Generalized additive models Generalized additive models are models where the response is explained by the added effect of functions of each covariate: y 0 g1 ( x1 ) g k ( xk ) The functions are not known but can be arbitrarily complicated splines. A penalty term for spline complexity is added to the likelihood. This makes this a borderline Bayesian inference, since a penalty terms function functions in all respects like a prior distribution. In R this is implemented as ’gam’ in the ’mgcv’ library. gam(y~x1+s(x2)) Says that covariate x1 will be included linearly while covariate x2 will be given a generalized additive treatment. When the number of covariates is high compared to the number of data Sometimes we know that there ought to be a relationship between response y and covariates x1,…,xk. But if the number of measurements is low, it can be difficult to get reliable estimates for the regression. When n<k, it’s even impossible! There are several ways out of this: • Principal component regression (PCR). Do a principal component analysis on the covariates (which decomposes the variance into the most important direction, the second most important etc). Use these components as covariates, adding one and one component at a time. • Partial Least Squares (PLS): Similar to PCR but decomposes the covariates in components that are also correlated to the response. • Ridge regression. Perform the regression with a penalty term proportional to the sum of the square regression parameters. Equivalent to having a Bayesian normally distributed prior. • Lasso regression. Perform the regression with a penalty term proportional to the sum of the absolute values of the regression parameters. Equivalent to having a Bayesian exponentially distributed prior. • Bayesian regression with informative priors. Regression between time series – a bad idea If we wish to do regression of (for instance) a discharge time series, with another such series as covariate, we run into difficulties. The model assumptions behind regression (more specifically that of independent noise in the response) is typically no longer available. The estimates will be unbiased, but uncertainty and model choice criterions can be extremely erroneous. Typically, the uncertainty will be severely underestimated, because we are pretending to have much more independent information that we’ve actually got. Here are two independently simulated time series. If we plot one against the other, they might look like they are dependent. A linear regression ”confirms” this. But this is caused by both series being dependent in time. Result from R: summary(lm(x2~x1)): x1 -0.47232 0.04747 -9.95 < 2e-16 *** But we know the series are not correlated! Ways of dealing with this: Take averages over large enough time span that the time correlation disappears or do time series analysis. Time series analysis Statistical time series are data in time, where there is some kind of dependency between what happens at a point in time and what happens next. Examples: discharge, reservoir volume, sediment transport, precipitation… PS: If you look at such data with coarse enough time resolution, the dependencies might become negligible, but then you might be left with very little data! If time dependency is not handled, uncertainties tend to be underestimated and model choice methods can’t be trusted. Important concept: Stationarity. A time series model is stationary if all marginal and joint probabilities are the same independent of time. When the model clashes with reality – independent noise vs time series Here I have simulated “water temperature” with expectation =10. Assume known standard deviation, =2. Wish to estimate and test =10. • Model 1, independence: Ti=+i, i~N(0,1) i.i.f. The graph disagrees with this assumption… Estimate: ˆ x 11.4, sd ( ˆ ) s / n 0.2 95% conf. int. for : (11.02,11.80). =10 rejected with 95% confidence! • Model 2, autoregressive model with expectation , standard deviation and auto-correlation a. Linear dependency between temperature one day and the next. 1 a Estimate: ˆ x 11.4, sd ( ˆ ) s 1.4 n 1 a 95% conf. int. for : (8.7,14.10). =10 not rejected. Time series – diagnostic plots 1. 2. 3. Auto-correlation. This is a plot that shows the correlation between the value at one time step and the next, the second next, the third next etc (this is called the lag). Usually this will decrease with the lag, but seasonality can cause problems. Winter value are typically negatively correlated to summer values and positively correlated to values the previous winter. Cross-correlation plots. When you want to see the linear dependency of one time series on with another. Fourier analysis. This decomposes a time series into sine/cosine-functions with different periodicity. Time series with seasonality will have a strong top for the year period. Diagnostics and seasonality Many hydrological time series has seasonality. One should however be able to ask what the nature of the time series is after one has taken this into account. In the START system, there is an option called ”avvik fra normal årsvariasjon”, which subtracts the yearly mean and divides by the yearly standard deviation. Thus one can look at (and model) the autocorrelation after this deterministic trend has been removed. Without such an operation, and analysis of temperature data will typically give a characteristic correlation time (the time for the correlation to drop by a factor of 1/2) of several years. After the operations, this characteristic time will typically be in a manner of days or weeks instead. What this means is that the information that it was unusually hot for the season a couple of weeks ago gives little information about what to expect today. The standard time series tool box: ARIMA models There exists an arsenal of statistical time series models called the ARIMA models. These models are made by combining auto-regression (AR), integration (I) and moving averages (MA). AR models: These are models where the next value depends linearly on a set of previous values. For instance, in the AR(1) model, one value depends on it’s past only through the previous value (this is also known as a Markov chain): xt xt 1 (1 ) t where t ~ N (0,1) is independen t noise MA models: Models based on moving averages of noise: xt t 1 t 1 ... p t p where t ~ N (0, ) is independen t noise Integrated models: Instead of modelling the original time series, it is the difference from one time step to the next that is modelled: yt xt xt 1 . This is done for time series that are not stationary, in the hope that this will render the model stationary. Season dependency: There is also a seasonal ARIMA, where the usual ARIMA terms looks at values one or more years back in time rather than one or more time steps. More diagnostics An MA model will have auto-correlation plots that suddenly dies out for lags beyond the size of the moving average window. So, if the auto-correlation completely disappears after k time steps, one have a MA(k) model. An AR model can be examined by a similar plot where the autocorrelation for one lag is removed before looking at the next. This is a called a partial auto-correlation plot (pacf). It will suddenly drop to zero after k lags for a AR(k) series. Here is an example, using an AR(1) model: Stochastic processes Processes are collections of stochastic variables that has dependency structure among themselves and that are ordered chronologically in time. Processes forms statistical models for time series. Ex: Water temperature, discharge, precipitation, flood events, a series of dice throws, the number of wolves in Norway, the evolution of the size of an organism, the organization structure of NVE. Some processes can be natural to model in discrete time (annual discharge maxima, dice throws). Other processes might be more natural to model in continuous time (discharge, the number of wolves in Norway, the evolution of the size of an organism). Some times you can choose whatever you feel is most convenient. Just as distributions can have free parameters, so can processes (think back to the examples of the Bernoulli and Poisson process). Most of the usual suspects of the distribution families are associated with various processes. Time series modelling – Markov chains A Markov chain is a process where the state at one time depends on previous history only through the most recent past state: x1 x2 x3 x4 x5 x6 ….. xn I would argue that if you do not have a Markov chain model, you have not sufficiently described your state space. (For instance, if you want to model the position of a particle as a function of time, you don’t want to model just the position but also the velocity.) If you start with the general expression for the likelihood of dependent stuff, this simplifies considerably with Markov chains: f ( x1 , x2 , x3 , x4 ,, xn ) f ( x1 ) f ( x2 | x1 ) f ( x3 | x1 , x2 ) f ( x4 | x1 , x2 , x3 ) f ( xn | x1 , x2 , x3 , x4 ,, xn 1 ) f ( x1 ) f ( x2 | x1 ) f ( x3 | x2 ) f ( x4 | x3 ) f ( xn | xn 1 ) If the transition probabilities, f(xt|xt-1), are the same for all time points, then the likelihood simplifies even more. If also the marginal distribution f(xt) stays the same, the process is stationary. Hidden Markov chains Hydrological and meteorological states in nature have an element of stochasticity (non-predictability). Hopefully, they can be modelled as Markov chains (with enough relevant state variables). L D However, the data we receive are not directly the state of nature, but fallible measurements on these states. The state itself is thus a hidden (latent) set of variables. Assuming independent measurement noise, the dependency structure looks like this: time State: Observations: x1 x2 x3 xn y1 y2 y3 yn Example of discrete time Markov chains • Random walk: xt=xt-1+t where t is independent noise(t~N(,) typically). Note that this process is not stationary, since we are all the time adding noise. The variance increases linearly with time. Since t does not need to have zero expectancy, one can also have a linear trend in the expectancy of the process. • Autoregressive model, AR(1); xt=(1-a)+axt-1+t where t typically is standard normally distributed and -1<a<1. if one starts off with x1 ~ N ( , / 1 a 2 ) then the distribution at any later time will be the same. The marginal distribution will in any case converge towards this • Autoregressive model, AR(k), k>1: xt=(1-a1-a2-…-ak)+a1xt-1+ a2xt-2+…+akxt-k+t With some restrictions, this can also be a stationary process. It is a Markov chain, since (xt,…,xt-k+1) is expressed through(xt-1,…,xt-k). It is also an example of a vector process. Example of discrete time Markov chains (2) • Correlated autoregressive processes: One can expand AR(1) to a vector process of two or more different processes having correlated noise: x t A( x t 1 ) t der t ~ N (0, ) where A is a diagonal matrix with individual autocorrelations and is a covariance matrix. Her e x(black) and y(blue) has noise correlation 0.8. • Regressive (causal) cross terms: xt x a x ( xt x ( yt y )) x t( x ) yt y a y ( yt y ) x t( y ) • Both these and AR(k) can be generalized to: x t A( x t 1 ) t der t ~ N (0, ) where A is now a general matrix. x=black, y=blue. Note that y must spend some time above it’s expectation before x reacts by climbing up. Continuous time Markov chains – stochastic differential equations A differential equation gives you a function in time. A stochastic differential equation is similar but has some elements of stochasticity (thus making a stochastic process in continuous time). From this you can make continuous time expansions of what was seen on the previous slides (see figures). Mathematically, the continuous time parent of the AR(1) model (called the OrnsteinUhlenbeck process) looks like this: dx(t ) ( x(t ) )dt / t 2 / t dBt Wiener process (random walk) 1.96 t Ornstein-Uhlenbeck process -1.96 Correlated OU. Causal model (black reacts to red) Hidden Markov chains (2) time State: Observations: x1 x2 x3 xn y1 y2 y3 yn Hidden Markov chains have two ingredients: a) The System equation (SE), telling you how the hidden Markov chain works; f(xk|xk-1) k. b) The observational equation (OE), which tells how the observations are related to the state f(yk|xk). Starting at the start and progressively using the SE, the OE, the law of total probability and Bayes formula, you can get inference for the state given the observations so far and also the likelihood. This is called filtering. Once you have that, you can also work backwards and get inference for the state given all the observations. The Kalman filter A Kalman filter is a way of analytically do all the filtering work analytically, but only if you have normal observational noise and normal SE with linear updates: a) SL: x k Fk x k 1 m k 1 k where k ~ N (0, Qk ) where k ~ N (0, Rk ) b) OL: yk H k x k k Note that all the models I have outlined so far, is on this form! All the steps in the filtering can be done analytically. Keep in mind that the normal distribution is specified by it’s expectancy and variance, so only that is needed. 1. You can calculate the mean and variance of xk|yk-1,…,y1, The first given the filtering from the previous time step, k-1. application of the 2. You can then find the mean and variance of the next Kalman filter was observation, given the previous ones: yk|yk-1,…,y1. This the Apollo project! gives you the update likelihood. 3. You can then find the mean and variance of the state given all the observations including this one: xk|yk,…,y1. You then step back and repeat for observations k+1. The expressions for these means and a variances can be found in many text books, on Wikipedia and in my previous NVE course. Example of using the Kalman filter and Kalman smoother for interpolation In this example a set of very close temperature series has been fetched. The correlated noise model is used for the hidden Markov chain. Some data was artificially removed (so I have the true values), in order for this method to interpolate the values. The Kalman filter was used for calculating likelihoods and thus finding the best parameter set(s). The plots show the Kalman smoothing, both the expectancy and the credibility intervals (thus the variance). Since the model allows correlations, one station gives information about the other. it can deal with different stations falling out at different times. Two or even all three stations can be out at any given time. Where all stations have fallen out, the uncertainty ”bubbles out”. Non-linear models and partikkel filters A better model for discharge than a linear model can be made by a (simple) hydrological model. Assume are humidity to be a linear OU process + seasonaily and area precipitation a thresholded version of that. Put that as input into a lake and use a rating curve + mass balance to get discharge out. This model can be reduced to two component, a linear stochastic differential equation and a non-linear ordinary differential equation. Non-linear models can’t be solved by the Kalman filter. Alternatives: particle filter (general method based on simulation), extended Kalman filter (linearization). When using particle filtering on Farstad, the first artificial gap (see extra exercise 7) is filled out in a much more reasonable manner. This is heavy machinery, though! Spatial models and time-space fields Interpolation and extrapolation is something that is wanted in space as well as time. If you have a model for spatial dependencies, you can adapt it to data using statistics, estimate unmeasured places and assess the uncertainty of such estimates. Such models can be discrete or continuous. ”Kriging” is an often used method, where you assume a function form of the dependency-structure (by use of so-called semivariograms) and bin and count your data according to comparison between model and data. You then do regression between the derived data and your model. Alternative: ML or Bayesian analysis on the dependency structure itself. (INLA) Expansion: space-time fields, where you are looking at dependencies both in space and time. Thus you can interpolate unmeasured combination of space and time. (Ex: precipitation, temperature).