Lec8 - Time Series M..

advertisement
Lecture 8 – Time Series Analysis: Part I Basics
Dr. Qing He
qinghe@buffalo.edu
University at Buffalo
1
Introduction to Time Series Analysis
 Background
* The analysis of experimental data that have been observed at different
points in time leads to new and unique problems in statistical modeling
and inference.
* The obvious correlation introduced by the sampling of adjacent points in
time can severely restrict the applicability of the many conventional
statistical methods traditionally dependent on the assumption that these
adjacent observations are independent and identically distributed.
* The systematic approach by which one goes about answering the
mathematical and statistical questions posed by these time correlations
is commonly referred to as time series analysis.
 Time series in transportation area:
*
*
*
*
Traffic data (speed, volume, occupancy) reported by a fixed detector
Gas price over time
Passenger volumes at airport terminals
……
2
Classification of Time-Series Patterns
3
Classification of time-series
 ‘Dimension’ of T
* Time, space-time
 Nature of T
* Discrete
• Equally
• Not equally
 Dimension of X
* Univariate
* Multivariate
 Memory types
* Stationary
* Nonstationary
* Continuous
• Observed continuously
 Linearity
* Linear
* Nonlinear
4
Approaches to Time Series Analysis
 The time domain approach
* motivated by the presumption that correlation between adjacent points
in time is best explained in terms of a dependence of the current value
on past values.
* Methods: ARIMA model, GARCH model, the state-space model, Kalman
filter (short-term traffic forecasting)
 The frequency domain approach
* assumes the primary characteristics of interest in time series analyses
relate to periodic or systematic sinusoidal variations found naturally in
most data.
* Methods: Spectral analysis, Wavelet analysis (real-time incident
detection)
 In many cases, the two approaches may produce similar answers
for long series, but the comparative performance over short
samples is better done in the time domain.
5
Time Series Statistical Models
 we assume a time series can be defined as a collection of
random variables indexed according to the order they are
obtained in time.
 For example, we may consider a time series as a sequence of
random variables, x1, x2, . . . , where the random variable x1
denotes the value taken by the series at the first time point, the
variable x2 denotes the value for the second time period…
 In general, a collection of random variables, {xt}, indexed by t, is
referred to as a stochastic process. In this text, t will typically be
discrete and vary over the integers t = 0,±1,±2, ..., or some
subset of the integers. we use the term time series whether we
are referring generically to the process or to a particular
realization
6
Example 10.1 White Noise
 A simple kind of generated series might be a collection of
uncorrelated random variables, wt, with mean 0 and finite
variance σ2w. The time series generated from uncorrelated
variables is used as a model for noise in engineering
applications, where it is called white noise;
 We shall sometimes denote this process as wt ∼ wn(0, σ2w). The
designation white originates from the analogy with white light
and indicates that all possible periodic oscillations are present
with equal strength.
 White independent noise: wt ∼ iid(0, σ2w).
 Gaussian white noise: wt ∼ iid N(0, σ2w).
7
Example 10.2 Moving Averages
 We might replace the white noise series wt by a moving average
that smooths the series. For example, consider replacing wt in
Example 10.2 by an average of its current value and its
immediate neighbors in the past and future. That is, let
1
vt  ( wt 1  wt  wt 1 )
3
 which leads to a smoother version of series, reflecting the fact
that the slower oscillations.
8
Gaussian white noise series (top) and three-point moving
average of the Gaussian white noise series (bottom).
w = rnorm(500,0,1) # 500 N(0,1) variates
v = filter(w, sides=2, rep(1,3)/3) # moving
average
par(mfrow=c(2,1))
plot.ts(w)
plot.ts(v)
9
Example 10.3 Autoregressions
 Suppose we consider the second-order equation
xt  xt 1  0.9 xt  2  wt
 represents a regression or prediction of the current value xt of a
time series as a function of the past two values of the series,
and, hence, the term autoregression is suggested for this model.
 The autoregressive model above and its generalizations can be
used as an underlying model for many observed series and will
be studied in details later.
10
Example 10.4 Random Walk
 A model for analyzing trend is the random walk with drift model
given by
xt    xt 1  wt
 for t = 1, 2, . . ., with initial condition x0 = 0, and where wt is
white noise. The constant δ is called the drift, and when δ = 0,
above equation is called simply a random walk.
 The term random walk comes from the fact that, when δ = 0, the
value of the time series at time t is the value of the series at
time t − 1 plus a completely random movement determined by
wt.
t
 Alternative form:
xt  t  w j

j 1
11
Simulate a Random Walk by R
set.seed(154)
w = rnorm(200,0,1); x = cumsum(w)
wd = w +.2; xd = cumsum(wd)
plot.ts(xd, ylim=c(-5,55))
lines(x)
lines(.2*(1:200), lty="dashed")
12
Definitions of Time Series
 Suppose a collection of n random variables at arbitrary integer
time points t1, t2, . . . , tn,
 Joint distribution function:
 If i.i.d N(0,1):
where
 The one-dimensional distribution functions:
 The corresponding one-dimensional density functions:
13
Mean Function of Time Series
 The mean function:
 Mean Function of a Moving Average Series of white noises:
 Mean Function of a Random Walk with Drift:
14
Autocovariance
 Autocovariance function
Note that γx(s, t) = γx(t, s) for all time points s and t. The autocovariance
measures the linear dependence between two points on the same series
observed at different times.
Note: If independent, then uncorrelated. but not all uncorrelated
variables are independent. If, however, xs and xt are bivariate normal
(e.g. Gaussian white noises), γx(s, t) = 0 ensures their independence.
 It is clear that, for s = t, the autocovariance reduces to the
(assumed finite) variance, because
15
Examples of Autocovariance
 Autocovariance of White Noise
 Autocovariance of a 3-point Moving Average (σ2w=1)
It is convenient to calculate it as a function of the separation, s − t = h,
say, for h = 0,±1,±2, . . ..
For example, with h = 0,
16
Autocovariance of a 3-point Moving Average (σ2w=1)
cont’d
When h = 1,
When h = 2, autocovariance??? What about h=3??

17
Autocovariance of a Random Walk
min{s,t} represents the number of identical product
pair of E(wjwk) with j=k (which equals to σ2w )
18
Autocorrelation Function (ACF) and Cross-correlation
Function (CCF)
 The autocorrelation function (ACF) is defined as
The ACF measures the linear predictability of the series at time t, say, xt,
using only the value xs. If we can predict xt perfectly from xs through a linear
relationship, xt = β0 +β1xs, then the correlation will be 1 when β1 > 0, and −1
when β1 < 0.
 Often, we would like to measure the predictability of another
series yt from the series xs. Assuming both series have finite
variances, we define cross-covariance function
 and cross-correlation function (CCF)
19
Relative Separation by Shift h, Instead of Absolute
Positions (s,t)
 In the definitions before, the autocovariance and crosscovariance functions may change as one moves along the series
because the values depend on both s and t, the locations of the
points in time.
 Sometimes, the autocovariance function only depends on the
separation of xs and xt, say, h = |s − t|, and not on where the
points are located in time. As long as the points are separated by
h units, the location of the two points does not matter.
 This notion, called weak stationarity, when the mean is
constant, is fundamental in allowing us to analyze sample time
series data when only a single series is available.
20
Stationary Time Series
 A strictly (strongly) stationary time series is one for which the
probabilistic behavior of every collection of values
 is identical to that of the time shifted set
 That is equivalent to same joint distribution
 for all k = 1, 2, ..., all time points t1, t2, . . . , tk, all numbers c1, c2, .
. . , ck, and all time shifts h = 0,±1,±2, ... .
21
Comments on Strictly Stationary
 If a time series is strictly stationary, then all of the multivariate
distribution functions for subsets of variables must agree with
their counterparts in the shifted set for all values of the shift
parameter h.
 When k = 2, we can write
 for any time points s and t and shift h. Thus, if the variance
function of the process exists, the above equation implies that
the autocovariance function of the series xt satisfies
 for all s and t and h. We may interpret this result by saying the
autocovariance function of the process depends only on the
time difference between s and t, and not on the actual times.
22
Too Strong, What is the other option?
 A weakly stationary time series, xt, is a finite variance process
such that
 (i) the mean value function, μt, defined as below, is constant and
does not depend on time t, and
where
 (ii) the covariance function, γ(s, t), defined as below, depends on
s and t only through their difference |s − t|=h.
where
γ(h, 0) does not depend on the time argument t; we have assumed that var(xt) = γ(0, 0) <
∞. Henceforth, for convenience, we will drop the second argument of γ(h, 0). We can drop
the second parameter!
23
Autocovariance Function of a Stationary Time Series
 The autocovariance function of a stationary time series will be
written as
 The autocorrelation function (ACF) of a stationary time series
will be written as
24
Stationarity of White Noise
 The autocovariance function of the white noise series of
previous Examples 10.1 is
 Which can be easily evaluated as
 This means that the series is weakly stationary or stationary. If
the white noise variates are also normally distributed or
Gaussian, the series is also strictly stationary, as can be seen by
definition.
25
Stationarity of a 3-point Moving Average
 The three-point moving average process used in Examples 10.2
is stationary because we may write the autocovariance function
obtained as

What about the Stationarity of a Random walk with drift?
t
xt  t   w j
Not stationary since µt= δt,
which is always changing by t
j 1
26
Useful Properties of Autocovariance of a Stationary
Time Series
 First, the value at h = 0, namely
 is the variance of the time series; note that the Cauchy–Schwarz
inequality
implies:
 for all h. This property follows because shifting the series by h
means that
27
Joint Stationarity
 Two time series, say, xt and yt, are said to be jointly stationary if
they are each stationary, and the cross-covariance function
 is a function only of lag h.
 The cross-correlation function (CCF) of jointly stationary time
series xt and yt is defined as
28
Example 10.5 Joint Stationarity
 Consider the two series, xt and yt, formed from the sum and
difference of two successive values of a white noise process, say,
and
 where wt are independent random variables with zero means
and variance σ2w. It is easy to show that γx(0) = γy(0) = 2σ2w and
γx(1) = γx(−1) = σ2w, γy(1) = γy(−1) = − σ2w . Also,
 because only one product is nonzero. Similarly, γxy(0) = 0, γxy(−1)
=− σ2w.
 Clearly, the autocovariance and cross-covariance functions
depend only on the lag separation, h, so the series are jointly
stationary.
29
Comments of Weak Stationarity
 The concept of weak stationarity forms the basis for much of the
analysis performed with time series.
 The fundamental properties of the mean function
 and autocovariance function
 are satisfied by many theoretical models that appear to generate
plausible sample realizations.
30
Correlation Estimation from Samples
 Although the theoretical autocorrelation and cross-correlation
functions are useful for describing the properties of certain
hypothesized models, most of the analyses must be performed
using sampled data.
 This limitation means the sampled points x1, x2, . . . , xn only are
available for estimating the mean, autocovariance, and
autocorrelation functions.
 In the usual situation with only one realization, therefore, the
assumption of stationarity becomes critical.
31
Correlation Estimation from Samples cont’d
 Sample mean
 Sample autocovariance function
with γ(−h) = γ(h) for h = 0, 1, . . . , n − 1.
 Sample autocorrelation function
32
Correlation Estimation from Samples cont’d
 Sample cross-covariance function
 Sample cross-correlation function
33
Example 10.5 A Simulated Time Series
 Consider a contrived set of data generated by tossing a fair coin,
letting xt = 1 when a head is obtained and xt = −1 when a tail is
obtained. Construct yt as
 Table 1.1 shows sample realizations of the appropriate processes
with x0 = −1 and n = 10.
34
Example 10.5 A Simulated Time Series Cont’d
 The sample autocorrelation for the series yt can be calculated
using equations on page 32 for h = 0, 1, 2, . . .. It is not necessary
to calculate for negative values because of the symmetry. For
example, for h = 3, the autocorrelation becomes the ratio of
mean(y)=5.14
35
Example 10.5 A Simulated Time Series Cont’d
 The theoretical ACF can be obtained from the model on page 34
using the fact that the mean of xt is zero and the variance of xt is
one. It can be shown that
and ρy(h) = 0 for |h| > 1 (try it after
class). Table 1.2 compares the
theoretical ACF with sample ACFs for
a realization where n = 10 and
another realization where n = 100;
we note the increased variability in
the smaller size sample.
36
Example 10.6 ACF of Speech Signal
 Figure as below shows the ACF of the speech series. The original series appears to contain a
sequence of repeating short signals. The ACF confirms this behavior, showing repeating peaks
spaced at about 106-109 points. Autocorrelation functions of the short signals appear, spaced at
the intervals mentioned above.
 The distance between the repeating signals is known as the pitch period and is a fundamental
parameter of interest in systems that encode and decipher speech. Because the series is sampled
at 10,000 points per second, the pitch period appears to be between .0106 and .0109 seconds.
vs_file <- 'C:/Docs_Qing/Courses/Transportation
Analytics/incoming/Book_R_TimeSeries_Shumway/chapter1/speech.dat'
speech = scan(vs_file)
acf(speech,250)
37
Exploratory Data Analysis
 With time series data, it is the dependence between the values
of the series that is important to measure; we must, at least, be
able to estimate autocorrelations with precision. It would be
difficult to measure that dependence if the dependence
structure is not regular or is changing at every time point.
 Hence, to achieve any meaningful statistical analysis of time
series data, it will be crucial that, if nothing else, the mean and
the autocovariance functions satisfy the conditions of weak
stationarity.
Methods to make a nonstationary time series
stationary
1. Detrending
2. Differencing
3. Other transformations
38
Trend in a Nonstationary Model
 Perhaps the easiest form of nonstationarity to work with is the
trend stationary model wherein the process has stationary
behavior around a trend. We may write this type of model as
 where xt are the observations, μt denotes the trend, and yt is a
stationary process. Quite often, strong trend, μt, will obscure the
behavior of the stationary process, yt.
 Hence, there is some advantage to removing the trend as a first
step in an exploratory analysis of such time series. The steps
involved are to obtain a reasonable estimate of the trend
component, say μt, and then work with the residuals
39
Example 11.1 Detrending Global Temperature
 Here we suppose the model is of the form of
 where, given the scatter plot of temperature data, a straight line
might be a reasonable model for the trend, i.e.,
Linear
Regression
40
Example 11.1 Detrending Global Temperature cont’d
 Figure above shows the data with the estimated trend line
superimposed. To obtain the detrended series we simply subtract
μt from the observations, xt, to obtain the detrended series
41
Differencing
 we saw that a random walk might also be a good model for
trend. That is, rather than modeling trend as fixed, we might
model trend as a stochastic component using the random walk
with drift model,
 where wt is white noise and is independent of yt.
 If the appropriate model is in pp.11, then differencing the data,
xt, yields a stationary process; that is,
We leave it as an 5-min class quiz to show above equation is stationary
42
Differencing cont’d
 Advantage of differencing:
* One advantage of differencing over detrending is that no parameters are
estimated in the differencing operation.
 Disadvantage of differencing:
* Differencing does not yield an estimate of the stationary process yt
• If an estimate of yt is essential, then detrending may be more
appropriate.
• If the goal is to coerce the data to stationarity, then differencing
may be more appropriate.
43
Differencing Notation
 Because differencing plays a central role in time series analysis,
it receives its own notation. The first difference is denoted as
 As we have seen, the first difference eliminates a linear trend.
 In addition: A second difference, that is, the difference of above
equation,
xt  xt 1  xt  2 xt 1  xt  2
 can eliminate a quadratic trend,
 Suppose
x     t2  w
t

1
2
t
xt  xt 1  2 2  wt  2wt 1  wt  2
Which is apparently stationary!
44
Backshift Operator
 We define the backshift operator by
 and extend it to powers
 and so on. Thus,
 It is clear that we may then rewrite first difference as
 And second difference as
 Or
45
Backshift Operator cont’d
 Differences of order d are defined as
 where we may expand the operator (1−B)d algebraically to
evaluate for higher integer values of d. When d = 1, we drop it
from the notation.
 The differencing technique is an important component of the
ARIMA model of Box and Jenkins (1970), to be discussed in Lec
12.
46
Other Transformations
 Obvious aberrations are present that can contribute
nonstationary as well as nonlinear behavior in observed time
series. In such cases, transformations may be useful to equalize
the variability over the length of a single series. A particularly
useful transformation is
Question: how to handle non-positive xt?
 which tends to suppress larger fluctuations that occur over
portions of the series where the underlying values are larger.
 Other possibilities are power transformations in the Box–Cox
family of the form
47
Example of Box-Cox Transformation in R
b0 <- 10
b1 <- 0.5
t <- 1:100
wt <- rnorm(100)
lambda <- runif(1,0.3,0.9)
#construct a nonlinear time series xt
xt <- ((b0 + b1*t +
wt)*lambda+1)^(1/lambda)
plot(t,xt)
#assume we don’t know lambda
#use boxcox function to find best lambda
library(MASS)
boxcox(xt~t, lambda = seq(-2, 2, 1/10) )

xt  1

ln xt

  0     t

0
  0 
Q: How to obtain
best lambda?
48
Smoothing in the Time Series Context
- Moving Average Smoother
 We discussed using a moving average to smooth white noise.
This method is useful in discovering certain traits in a time
series, such as long-term trend and seasonal components.
 In particular, if xt represents the observations, then
 where aj = a−j ≥ 0 and sum(aj) = 1 is a symmetric moving average
of the data.
49
Smoothing in the Time Series Context
- Polynomial Regression Smoother
 The general setup for a time plot is
 where ft is some smooth function of time, and yt is a stationary
process. We may think of the moving average smoother mt, as
an estimator of ft. An obvious choice for ft is polynomial
regression
 Difference with Moving Average Smoother:
* A problem with the techniques polynomial regression smoother is that
they assume ft is the same function over the range of time, t; we might
say that the technique is global.
* The moving average smoothers fit the data better because the technique
is local; That is, moving average smoothers allow for the possibility that ft
is a different function over time.
50
Smoothing in the Time Series Context
- Kernel Smoothing
 Kernel smoothing is a moving average smoother that uses a
weight function, or kernel, to average the observations:
 where
• This estimator is called the Naradaya–Watson estimator (Watson, 1966).
• K(·) is a kernel function; typically, the normal kernel, where
• The wider the bandwidth, b, the smoother the result. Why??
• b/2 is the inner quartile range of the kernel (when b=104, kernal smoother is
roughly a 52 point moving average)
51
5-week MA and
53-week MA
Different smoothing methods could
produce vastly different results!
a cubic trend (p=3) and
cubic trend plus periodic
regression
bandwidth b=10 and
b=104
52
Common Smoothing Methods and R Function
Smoothing
Scope
R function
Moving average
Local
filter(…)
Polynomial regression smoother
Global
lm(…)
Kernel smoothing
Global
ksmooth(…)
Nearest neighbor (Friedman, 1984)
Local
supsmu(…)
LOWESS (locally weighted scatterplot smoothing Local
technique, similar to a moving average, but each
smoothed value is given by a weighted linear
least squares regression over the span.)
lowess(…)
Smoothing splines (minimizes a compromise
between the fit and the degree of smoothness)
smooth.spline
(…)
Global
53
Download