Lecture 8 – Time Series Analysis: Part I Basics Dr. Qing He qinghe@buffalo.edu University at Buffalo 1 Introduction to Time Series Analysis Background * The analysis of experimental data that have been observed at different points in time leads to new and unique problems in statistical modeling and inference. * The obvious correlation introduced by the sampling of adjacent points in time can severely restrict the applicability of the many conventional statistical methods traditionally dependent on the assumption that these adjacent observations are independent and identically distributed. * The systematic approach by which one goes about answering the mathematical and statistical questions posed by these time correlations is commonly referred to as time series analysis. Time series in transportation area: * * * * Traffic data (speed, volume, occupancy) reported by a fixed detector Gas price over time Passenger volumes at airport terminals …… 2 Classification of Time-Series Patterns 3 Classification of time-series ‘Dimension’ of T * Time, space-time Nature of T * Discrete • Equally • Not equally Dimension of X * Univariate * Multivariate Memory types * Stationary * Nonstationary * Continuous • Observed continuously Linearity * Linear * Nonlinear 4 Approaches to Time Series Analysis The time domain approach * motivated by the presumption that correlation between adjacent points in time is best explained in terms of a dependence of the current value on past values. * Methods: ARIMA model, GARCH model, the state-space model, Kalman filter (short-term traffic forecasting) The frequency domain approach * assumes the primary characteristics of interest in time series analyses relate to periodic or systematic sinusoidal variations found naturally in most data. * Methods: Spectral analysis, Wavelet analysis (real-time incident detection) In many cases, the two approaches may produce similar answers for long series, but the comparative performance over short samples is better done in the time domain. 5 Time Series Statistical Models we assume a time series can be defined as a collection of random variables indexed according to the order they are obtained in time. For example, we may consider a time series as a sequence of random variables, x1, x2, . . . , where the random variable x1 denotes the value taken by the series at the first time point, the variable x2 denotes the value for the second time period… In general, a collection of random variables, {xt}, indexed by t, is referred to as a stochastic process. In this text, t will typically be discrete and vary over the integers t = 0,±1,±2, ..., or some subset of the integers. we use the term time series whether we are referring generically to the process or to a particular realization 6 Example 10.1 White Noise A simple kind of generated series might be a collection of uncorrelated random variables, wt, with mean 0 and finite variance σ2w. The time series generated from uncorrelated variables is used as a model for noise in engineering applications, where it is called white noise; We shall sometimes denote this process as wt ∼ wn(0, σ2w). The designation white originates from the analogy with white light and indicates that all possible periodic oscillations are present with equal strength. White independent noise: wt ∼ iid(0, σ2w). Gaussian white noise: wt ∼ iid N(0, σ2w). 7 Example 10.2 Moving Averages We might replace the white noise series wt by a moving average that smooths the series. For example, consider replacing wt in Example 10.2 by an average of its current value and its immediate neighbors in the past and future. That is, let 1 vt ( wt 1 wt wt 1 ) 3 which leads to a smoother version of series, reflecting the fact that the slower oscillations. 8 Gaussian white noise series (top) and three-point moving average of the Gaussian white noise series (bottom). w = rnorm(500,0,1) # 500 N(0,1) variates v = filter(w, sides=2, rep(1,3)/3) # moving average par(mfrow=c(2,1)) plot.ts(w) plot.ts(v) 9 Example 10.3 Autoregressions Suppose we consider the second-order equation xt xt 1 0.9 xt 2 wt represents a regression or prediction of the current value xt of a time series as a function of the past two values of the series, and, hence, the term autoregression is suggested for this model. The autoregressive model above and its generalizations can be used as an underlying model for many observed series and will be studied in details later. 10 Example 10.4 Random Walk A model for analyzing trend is the random walk with drift model given by xt xt 1 wt for t = 1, 2, . . ., with initial condition x0 = 0, and where wt is white noise. The constant δ is called the drift, and when δ = 0, above equation is called simply a random walk. The term random walk comes from the fact that, when δ = 0, the value of the time series at time t is the value of the series at time t − 1 plus a completely random movement determined by wt. t Alternative form: xt t w j j 1 11 Simulate a Random Walk by R set.seed(154) w = rnorm(200,0,1); x = cumsum(w) wd = w +.2; xd = cumsum(wd) plot.ts(xd, ylim=c(-5,55)) lines(x) lines(.2*(1:200), lty="dashed") 12 Definitions of Time Series Suppose a collection of n random variables at arbitrary integer time points t1, t2, . . . , tn, Joint distribution function: If i.i.d N(0,1): where The one-dimensional distribution functions: The corresponding one-dimensional density functions: 13 Mean Function of Time Series The mean function: Mean Function of a Moving Average Series of white noises: Mean Function of a Random Walk with Drift: 14 Autocovariance Autocovariance function Note that γx(s, t) = γx(t, s) for all time points s and t. The autocovariance measures the linear dependence between two points on the same series observed at different times. Note: If independent, then uncorrelated. but not all uncorrelated variables are independent. If, however, xs and xt are bivariate normal (e.g. Gaussian white noises), γx(s, t) = 0 ensures their independence. It is clear that, for s = t, the autocovariance reduces to the (assumed finite) variance, because 15 Examples of Autocovariance Autocovariance of White Noise Autocovariance of a 3-point Moving Average (σ2w=1) It is convenient to calculate it as a function of the separation, s − t = h, say, for h = 0,±1,±2, . . .. For example, with h = 0, 16 Autocovariance of a 3-point Moving Average (σ2w=1) cont’d When h = 1, When h = 2, autocovariance??? What about h=3?? 17 Autocovariance of a Random Walk min{s,t} represents the number of identical product pair of E(wjwk) with j=k (which equals to σ2w ) 18 Autocorrelation Function (ACF) and Cross-correlation Function (CCF) The autocorrelation function (ACF) is defined as The ACF measures the linear predictability of the series at time t, say, xt, using only the value xs. If we can predict xt perfectly from xs through a linear relationship, xt = β0 +β1xs, then the correlation will be 1 when β1 > 0, and −1 when β1 < 0. Often, we would like to measure the predictability of another series yt from the series xs. Assuming both series have finite variances, we define cross-covariance function and cross-correlation function (CCF) 19 Relative Separation by Shift h, Instead of Absolute Positions (s,t) In the definitions before, the autocovariance and crosscovariance functions may change as one moves along the series because the values depend on both s and t, the locations of the points in time. Sometimes, the autocovariance function only depends on the separation of xs and xt, say, h = |s − t|, and not on where the points are located in time. As long as the points are separated by h units, the location of the two points does not matter. This notion, called weak stationarity, when the mean is constant, is fundamental in allowing us to analyze sample time series data when only a single series is available. 20 Stationary Time Series A strictly (strongly) stationary time series is one for which the probabilistic behavior of every collection of values is identical to that of the time shifted set That is equivalent to same joint distribution for all k = 1, 2, ..., all time points t1, t2, . . . , tk, all numbers c1, c2, . . . , ck, and all time shifts h = 0,±1,±2, ... . 21 Comments on Strictly Stationary If a time series is strictly stationary, then all of the multivariate distribution functions for subsets of variables must agree with their counterparts in the shifted set for all values of the shift parameter h. When k = 2, we can write for any time points s and t and shift h. Thus, if the variance function of the process exists, the above equation implies that the autocovariance function of the series xt satisfies for all s and t and h. We may interpret this result by saying the autocovariance function of the process depends only on the time difference between s and t, and not on the actual times. 22 Too Strong, What is the other option? A weakly stationary time series, xt, is a finite variance process such that (i) the mean value function, μt, defined as below, is constant and does not depend on time t, and where (ii) the covariance function, γ(s, t), defined as below, depends on s and t only through their difference |s − t|=h. where γ(h, 0) does not depend on the time argument t; we have assumed that var(xt) = γ(0, 0) < ∞. Henceforth, for convenience, we will drop the second argument of γ(h, 0). We can drop the second parameter! 23 Autocovariance Function of a Stationary Time Series The autocovariance function of a stationary time series will be written as The autocorrelation function (ACF) of a stationary time series will be written as 24 Stationarity of White Noise The autocovariance function of the white noise series of previous Examples 10.1 is Which can be easily evaluated as This means that the series is weakly stationary or stationary. If the white noise variates are also normally distributed or Gaussian, the series is also strictly stationary, as can be seen by definition. 25 Stationarity of a 3-point Moving Average The three-point moving average process used in Examples 10.2 is stationary because we may write the autocovariance function obtained as What about the Stationarity of a Random walk with drift? t xt t w j Not stationary since µt= δt, which is always changing by t j 1 26 Useful Properties of Autocovariance of a Stationary Time Series First, the value at h = 0, namely is the variance of the time series; note that the Cauchy–Schwarz inequality implies: for all h. This property follows because shifting the series by h means that 27 Joint Stationarity Two time series, say, xt and yt, are said to be jointly stationary if they are each stationary, and the cross-covariance function is a function only of lag h. The cross-correlation function (CCF) of jointly stationary time series xt and yt is defined as 28 Example 10.5 Joint Stationarity Consider the two series, xt and yt, formed from the sum and difference of two successive values of a white noise process, say, and where wt are independent random variables with zero means and variance σ2w. It is easy to show that γx(0) = γy(0) = 2σ2w and γx(1) = γx(−1) = σ2w, γy(1) = γy(−1) = − σ2w . Also, because only one product is nonzero. Similarly, γxy(0) = 0, γxy(−1) =− σ2w. Clearly, the autocovariance and cross-covariance functions depend only on the lag separation, h, so the series are jointly stationary. 29 Comments of Weak Stationarity The concept of weak stationarity forms the basis for much of the analysis performed with time series. The fundamental properties of the mean function and autocovariance function are satisfied by many theoretical models that appear to generate plausible sample realizations. 30 Correlation Estimation from Samples Although the theoretical autocorrelation and cross-correlation functions are useful for describing the properties of certain hypothesized models, most of the analyses must be performed using sampled data. This limitation means the sampled points x1, x2, . . . , xn only are available for estimating the mean, autocovariance, and autocorrelation functions. In the usual situation with only one realization, therefore, the assumption of stationarity becomes critical. 31 Correlation Estimation from Samples cont’d Sample mean Sample autocovariance function with γ(−h) = γ(h) for h = 0, 1, . . . , n − 1. Sample autocorrelation function 32 Correlation Estimation from Samples cont’d Sample cross-covariance function Sample cross-correlation function 33 Example 10.5 A Simulated Time Series Consider a contrived set of data generated by tossing a fair coin, letting xt = 1 when a head is obtained and xt = −1 when a tail is obtained. Construct yt as Table 1.1 shows sample realizations of the appropriate processes with x0 = −1 and n = 10. 34 Example 10.5 A Simulated Time Series Cont’d The sample autocorrelation for the series yt can be calculated using equations on page 32 for h = 0, 1, 2, . . .. It is not necessary to calculate for negative values because of the symmetry. For example, for h = 3, the autocorrelation becomes the ratio of mean(y)=5.14 35 Example 10.5 A Simulated Time Series Cont’d The theoretical ACF can be obtained from the model on page 34 using the fact that the mean of xt is zero and the variance of xt is one. It can be shown that and ρy(h) = 0 for |h| > 1 (try it after class). Table 1.2 compares the theoretical ACF with sample ACFs for a realization where n = 10 and another realization where n = 100; we note the increased variability in the smaller size sample. 36 Example 10.6 ACF of Speech Signal Figure as below shows the ACF of the speech series. The original series appears to contain a sequence of repeating short signals. The ACF confirms this behavior, showing repeating peaks spaced at about 106-109 points. Autocorrelation functions of the short signals appear, spaced at the intervals mentioned above. The distance between the repeating signals is known as the pitch period and is a fundamental parameter of interest in systems that encode and decipher speech. Because the series is sampled at 10,000 points per second, the pitch period appears to be between .0106 and .0109 seconds. vs_file <- 'C:/Docs_Qing/Courses/Transportation Analytics/incoming/Book_R_TimeSeries_Shumway/chapter1/speech.dat' speech = scan(vs_file) acf(speech,250) 37 Exploratory Data Analysis With time series data, it is the dependence between the values of the series that is important to measure; we must, at least, be able to estimate autocorrelations with precision. It would be difficult to measure that dependence if the dependence structure is not regular or is changing at every time point. Hence, to achieve any meaningful statistical analysis of time series data, it will be crucial that, if nothing else, the mean and the autocovariance functions satisfy the conditions of weak stationarity. Methods to make a nonstationary time series stationary 1. Detrending 2. Differencing 3. Other transformations 38 Trend in a Nonstationary Model Perhaps the easiest form of nonstationarity to work with is the trend stationary model wherein the process has stationary behavior around a trend. We may write this type of model as where xt are the observations, μt denotes the trend, and yt is a stationary process. Quite often, strong trend, μt, will obscure the behavior of the stationary process, yt. Hence, there is some advantage to removing the trend as a first step in an exploratory analysis of such time series. The steps involved are to obtain a reasonable estimate of the trend component, say μt, and then work with the residuals 39 Example 11.1 Detrending Global Temperature Here we suppose the model is of the form of where, given the scatter plot of temperature data, a straight line might be a reasonable model for the trend, i.e., Linear Regression 40 Example 11.1 Detrending Global Temperature cont’d Figure above shows the data with the estimated trend line superimposed. To obtain the detrended series we simply subtract μt from the observations, xt, to obtain the detrended series 41 Differencing we saw that a random walk might also be a good model for trend. That is, rather than modeling trend as fixed, we might model trend as a stochastic component using the random walk with drift model, where wt is white noise and is independent of yt. If the appropriate model is in pp.11, then differencing the data, xt, yields a stationary process; that is, We leave it as an 5-min class quiz to show above equation is stationary 42 Differencing cont’d Advantage of differencing: * One advantage of differencing over detrending is that no parameters are estimated in the differencing operation. Disadvantage of differencing: * Differencing does not yield an estimate of the stationary process yt • If an estimate of yt is essential, then detrending may be more appropriate. • If the goal is to coerce the data to stationarity, then differencing may be more appropriate. 43 Differencing Notation Because differencing plays a central role in time series analysis, it receives its own notation. The first difference is denoted as As we have seen, the first difference eliminates a linear trend. In addition: A second difference, that is, the difference of above equation, xt xt 1 xt 2 xt 1 xt 2 can eliminate a quadratic trend, Suppose x t2 w t 1 2 t xt xt 1 2 2 wt 2wt 1 wt 2 Which is apparently stationary! 44 Backshift Operator We define the backshift operator by and extend it to powers and so on. Thus, It is clear that we may then rewrite first difference as And second difference as Or 45 Backshift Operator cont’d Differences of order d are defined as where we may expand the operator (1−B)d algebraically to evaluate for higher integer values of d. When d = 1, we drop it from the notation. The differencing technique is an important component of the ARIMA model of Box and Jenkins (1970), to be discussed in Lec 12. 46 Other Transformations Obvious aberrations are present that can contribute nonstationary as well as nonlinear behavior in observed time series. In such cases, transformations may be useful to equalize the variability over the length of a single series. A particularly useful transformation is Question: how to handle non-positive xt? which tends to suppress larger fluctuations that occur over portions of the series where the underlying values are larger. Other possibilities are power transformations in the Box–Cox family of the form 47 Example of Box-Cox Transformation in R b0 <- 10 b1 <- 0.5 t <- 1:100 wt <- rnorm(100) lambda <- runif(1,0.3,0.9) #construct a nonlinear time series xt xt <- ((b0 + b1*t + wt)*lambda+1)^(1/lambda) plot(t,xt) #assume we don’t know lambda #use boxcox function to find best lambda library(MASS) boxcox(xt~t, lambda = seq(-2, 2, 1/10) ) xt 1 ln xt 0 t 0 0 Q: How to obtain best lambda? 48 Smoothing in the Time Series Context - Moving Average Smoother We discussed using a moving average to smooth white noise. This method is useful in discovering certain traits in a time series, such as long-term trend and seasonal components. In particular, if xt represents the observations, then where aj = a−j ≥ 0 and sum(aj) = 1 is a symmetric moving average of the data. 49 Smoothing in the Time Series Context - Polynomial Regression Smoother The general setup for a time plot is where ft is some smooth function of time, and yt is a stationary process. We may think of the moving average smoother mt, as an estimator of ft. An obvious choice for ft is polynomial regression Difference with Moving Average Smoother: * A problem with the techniques polynomial regression smoother is that they assume ft is the same function over the range of time, t; we might say that the technique is global. * The moving average smoothers fit the data better because the technique is local; That is, moving average smoothers allow for the possibility that ft is a different function over time. 50 Smoothing in the Time Series Context - Kernel Smoothing Kernel smoothing is a moving average smoother that uses a weight function, or kernel, to average the observations: where • This estimator is called the Naradaya–Watson estimator (Watson, 1966). • K(·) is a kernel function; typically, the normal kernel, where • The wider the bandwidth, b, the smoother the result. Why?? • b/2 is the inner quartile range of the kernel (when b=104, kernal smoother is roughly a 52 point moving average) 51 5-week MA and 53-week MA Different smoothing methods could produce vastly different results! a cubic trend (p=3) and cubic trend plus periodic regression bandwidth b=10 and b=104 52 Common Smoothing Methods and R Function Smoothing Scope R function Moving average Local filter(…) Polynomial regression smoother Global lm(…) Kernel smoothing Global ksmooth(…) Nearest neighbor (Friedman, 1984) Local supsmu(…) LOWESS (locally weighted scatterplot smoothing Local technique, similar to a moving average, but each smoothed value is given by a weighted linear least squares regression over the span.) lowess(…) Smoothing splines (minimizes a compromise between the fit and the degree of smoothness) smooth.spline (…) Global 53