BUS 41910 Time Series Analysis Linear State Space Models Dacheng Xiu University of Chicago Booth School of Business 1 References I Forecasting, Structural Time Series Models and the Kalman Filter, by A. C. Harvey. I Time Series Analysis, by J. Hamilton I Time Series Analysis by State Space Models, 2nd Edition, by J. Durbin and S. J. Koopman 2 General Form of Linear State Space Models The state space model is given by: Observation Equation : State Equation : i.i.d. yt = Zt αt + εt , αt+1 = Tt αt + Rt ηt , i.i.d. where εt ∼ N(0, Ht ), ηt ∼ N(0, Qt ), t = 1, 2, . . . , n. I yt : p × 1 observation vector I αt : m × 1 (unobservable) state vector I Zt , Tt , Rt , Ht , Qt are given (up to some unknown parameters). I Zt and Tt−1 can depend on y1 , y2 , . . . , yt−1 . I Rt is a selection matrix, i.e., columns of Im . I α1 ∼ N(a1 , P1 ), with a1 and P1 given. 3 Example: Local Level Model A particular example is the local level model, which can be used for modeling transaction prices: yt = αt + εt , αt+1 = αt + ηt , εt ∼ N(0, σε2 ) ηt ∼ N(0, ση2 ), where yt is the observed price, αt is the unobserved efficient price, modeled as a random walk. I σε2 denotes the variance of the market microstructure noise, i.e., bid-ask bounces. I ση2 is the volatility of asset returns. 4 Example: VAR(1) Model Consider the following VAR(1) model: Yt = AYt−1 + Ut . It can be written (trivially) in the state space form: y t = α t + εt , αt = Tt αt−1 + ηt where yt = Yt , αt = Yt , εt = 0, Tt = A, ηt = Ut . 5 Example: ARMA models Suppose yt follows an ARMA(2,1) model without a constant term: yt = µ+φ1 yt−1 + φ2 yt−2 + ζt + θ1 ζt−1 , yt+1 φ2 yt + θ1 ζt+1 = φ1 1 φ2 0 ζt−1 ∼ N(0, σζ2 ). yt φ2 yt−1 + θ1 ζt + 1 θ1 I εt = 0, Ht = 0. Obviously, state space form accommodates observation errors. I αt = (yt , φ2 yt−1 + θ1 ζt )| . I Zt = (1, 0). I The state-space representation is not unique. ζt+1 Homework: How to incorporate the constant µ? 6 Example: Regression Models The regression model for a univariate yt is given by yt = xt| β + εt , i.i.d. εt ∼ N(0, Ht ) where β is a k × 1 vector. I Zt = xt| , Tt = Ik . I Rt = Qt = 0. I αt = α1 = β. I Obviously, the state-space form allows time-varying βt . I Can also accommodate regression model with ARMA errors. 7 Filtering I The object of filtering is to update our knowledge of the system each time a new observation yt is brought in. I The classical filtering method is the Kalman filter, which works under normality assumption. I We first develop the theory of filtering for local level model. 8 Kalman Filter I Let Yt−1 be the vector of (y1 , . . . , yt−1 )| , for t = 2, 3, . . .. I Suppose αt |Yt−1 ∼ N(at , Pt ), αt |Yt ∼ N(at|t , Pt|t ). I The goal is to calculate at|t , Pt|t , at+1 , and Pt+1 , when yt is brought in, so we obtain the distribution of αt+1 |Yt . I Terminology: at|t filtered estimator of the state αt , and at+1 as the one-step ahead predictor of αt+1 . 9 A Useful Result Suppose that x and y are jointly normal with x µx x Σxx E = , Var = y µy y Σ|xy Σxy Σyy then the conditional distribution of x given y is normal with mean vector E(x|y ) = µx + Σxy Σ−1 yy (y − µy ), (1) and variance matrix | Var(x|y ) = Σxx − Σxy Σ−1 yy Σxy . (2) 10 Kalman Filter I Let vt = yt − at , for t = 1, 2, . . . , n. I Using (1) and (2), we have at|t =E(αt |yt , Yt−1 ) =E(αt |Yt−1 ) + Cov(αt , yt |Yt−1 ) (yt − E(yt |Yt−1 )) Var(yt |Yt−1 ) Pt vt , Pt + σε2 =Var(αt |yt , Yt−1 ) =at + Pt|t Cov(αt , yt |Yt−1 )2 Var(yt |Yt−1 ) 2 Pt Pt σε2 =Pt − = . Pt + σε2 Pt + σε2 =Var(αt |Yt−1 ) − 11 Kalman Filter I Finally, using the state equation, we have at+1 =E(αt+1 |Yt ) = E(αt |Yt ) = at|t Pt+1 =Var(αt+1 |Yt ) = Var(αt |Yt ) + ση2 = Pt|t + ση2 I To make these results consistent with the general results later, we introduce Ft = Var(vt |Yt−1 ) = Pt + σε2 , Kt = Pt Ft where Ft is the variance o the prediction error, and Kt is the Kalman gain. 12 Kalman Filter: Updating Equations I To summarize: vt =yt − at , at+1 =at + Kt vt , Ft = Pt + σε2 Pt+1 = Pt (1 − Kt ) + ση2 for t = 1, 2, . . . , n, where Kt = Pt /Ft . 13 Example: River Nile Data Annual follow volume at Aswan from 1871 - 1970. I R package: ‘KFAS’ - SSModel, fitSSM, KFS, fitted 0 20 40 60 Time 80 100 14000 10000 6000 Variance of Filtered Values 1000 800 600 400 Filtered Annual flow 1400 I 0 20 40 60 80 100 Time 14 Forecast Errors I Since vt s are linear combinations of y1 , y2 , . . . , yt−1 , we have p(v1 ) = p(y1 ), p(vt ) = p(yt |Yt−1 ) (the Jacobian term is 1), therefore the likelihood is given by p(y1 , y2 , . . . , yn ) = n Y p(yt |Yt−1 ) = t=1 n Y p(vt ) t=1 Therefore v1 , v2 , . . . , vn are mutually independent. i.i.d. I vt ∼ N(0, Ft ), t = 1, 2, . . . , n. I Kalman filter can be used for maximum likelihood estimation. 15 Error Recursions I The state estimation error is xt = αt − at , I Var(xt ) = Pt . Therefore, vt = yt − at = αt + εt − at = xt + εt . xt+1 =αt+1 − at+1 =αt + ηt − at − Kt vt =xt + ηt − Kt (xt + εt ) =Lt xt + ηt − Kt εt I xt is a linear combination of past xs, ηs and εs, so xt ⊥εt . I These derivations will be useful for state smoothing. 16 State Smoothing We now consider the estimation of α1 , . . . , αn given the entire sample path Yn , i.e., state smoothing. I The conditional density αt |Yn ∼ N(b αt , Vt ). I α b is the smoothed state, Vt is the smoothed state variance. By (1) again, we have α bt =E(αt |Yt−1 ) (3) −1 + Cov(αt , Yt:n |Yt−1 )Var(Yt:n |Yt−1 ) n X =at + Cov(αt , vj )Fj−1 vj (Yt:n − E(Yt:n |Yt−1 )) (4) j=t 17 Smoothed State Since Cov(αt , vj ) = Cov(xt , vj ), and Cov(xt , vt ) = E(xt (xt + εt )) = Var(xt ) = Pt Cov(xt , vt+1 ) = E(xt (xt+1 + εt+1 )) = E(xt (Lt xt + ηt − Kt εt + εt+1 )) = Pt Lt .. .. . . Cov(xt , vn ) = Pt Lt Lt+1 · · · Ln−1 . Plugging this to (4), we obtain the backward state smoothing recursion: α bt = at + Pt rt−1 , rt−1 = vt + Lt rt Ft with rn = 0, for t = n, n − 1, . . . , 1. 18 Smoothed State Variance By (2), we have Vt = Var(αt |Yn ) = Pt − n X Cov(αt , vj )2 Fj−1 . j=t Using the same trick as before, we obtain the state variance smoothing recursion: Vt = Pt − Pt2 Nt−1 , Nt−1 = Ft−1 + L2t Nt for t = n, . . . , 1, and Nn = 0. 19 Example: River Nile Data 0 20 40 60 Time 80 100 4000 3500 3000 2500 Variance of Smoothed Values 1200 800 400 Smoothed Annual flow State Smoothing 0 20 40 60 80 100 Time 20 Disturbance Smoothing We now turn to the calculations of the smoothed disturbances, together with their variances: εbt = E(εt |Yn ) = yt − α bt , ηbt = E(ηt |Yn ) = α bt+1 − α bt . It is easier, however, to calculate them using rt and Nt . The results are given here without a proof: εbt = σε2 ut , ut = Ft−1 vt − Kt rt , Var(εt |Yn ) = σε2 − σε4 Dt , t = n, . . . , 1 Dt = Ft−1 + Kt2 Nt . and ηbt = ση2 rt , t = n, . . . , 1 Var(ηt |Yn ) = ση2 − ση4 Nt . 21 Example: River Nile Data 20 0 -40 -20 Smoothed Eta 100 -100 -300 Smoothed Eps Disturbance Smoothing 1880 1920 Time 1960 1880 1920 1960 Time 22 Missing Observations I A distinct advantage of state space models is the ease with which missing observations can be dealt with. This is an important matter in practice. I For example, transaction prices arrive irregularly and asynchronously, so that certain prices are missing from the vector of all assets, at any point in time. 23 Dealing with Missing Observations Suppose that for some 1 < τ < τ ∗ ≤ n, observations yj , j = τ, . . . , τ ∗ − 1 are missing. For filtering at times t = τ, . . . , τ ∗ − 1, we have t−1 X E(αt |Yt ) = E(αt |Yτ −1 ) = E ατ + ηj Yτ −1 = aτ , j=τ E(αt+1 |Yt ) = E(αt+1 |Yτ −1 ) = aτ . Var(αt |Yt ) = Var(αt |Yτ −1 ) = Var ατ + t−1 X ηj Yτ −1 j=τ = Pτ + (t − τ )ση2 Var(αt+1 |Yt ) = Pτ + (t − τ + 1)ση2 . 24 Dealing with Missing Observations This leads to, for t = τ, . . . , τ ∗ − 1, at|t = at , at+1 = at , Pt|t = Pt , Pt+1 = Pt + ση2 . This amounts to replacing Kt = Pt /Ft by Kt = 0, i.e., no Kalman gain, at the missing time points, so the same code can be applied! Moreover, this simple twist applies to all formulas in forecast error recursions, state smoothing, and disturbance smoothing. 25 Forecasting Suppose ȳn+j be the minimum mean square error forecast of yn+j given Yn , for j = 1, 2, . . . J. Then we have ȳn+j = E(yn+j |Yn ), F̄n+j = Var(yn+j |Yn ). The problem can be regarded as a missing observations problem, i.e., with τ = n + 1 and τ ∗ = n + J in a filtering problem for yt with t = 1, 2, . . . , n + J! 26 Example: River Nile Data 0 20 40 60 Time 80 100 30000 15000 5000 Variance of Filtered Values 1200 800 400 Filtered Annual flow Filtering with missing values 0 20 40 60 80 100 Time 27 Example: River Nile Data 0 20 40 60 Time 80 100 8000 4000 Variance of Smoothed Values 1200 800 400 Smoothed Annual flow Smoothing with missing values 0 20 40 60 80 100 Time 28 Initialization I We now consider how to start up the filter when nothing about α1 is known. I It is reasonable to represent α1 as having a diffuse prior density, i.e., fix it at an arbitrary value and let P1 → ∞. I Plugging this in, we obtain a2 = y1 , and P2 = σε2 + ση2 , we can then proceed normally. I This is amount to treating y1 as given, and α1 ∼ N(y1 , σε2 ). I This is also equivalent to assuming α1 is unknown and estimating it using y1 (which is the MLE for α1 ). 29 Likelihood Estimation The log-likelihood is given by (assuming a1 and P1 are given up to some parameters) n n 1X v2 L = − log(2π) − log Ft + t , 2 2 Ft t=1 which can be easily implemented using Kalman filter. For the case with diffuse initialization, the likelihood is given by n n 1X vt2 Ld = − log(2π) − log Ft + 2 2 Ft t=2 30 EM Algorithm I The EM algorithm is a well-known tool for iterative maximum likelihood estimation., developed by Shumway and Stoffer (1982) and Watson and Engle (1983). I It consists of an E-step (expectation) and M-step (maximization). I For the state space model, it has a particular neat form. 31 EM Algorithm for Local Level Model As an illustration, we apply EM algorithm to estimate the local level model. Note that the log likelihood knowing α is ( ) n 2 ηt−1 ε2t 1X 2 2 log σε + log ση + 2 + 2 log p(α, Yn |θ) =const − 2 σε ση t=1 The log-likelihood function of the data is given by: log p(Yn |θ) = log p(α, Yn |θ) − log p(α|Yn , θ). 32 EM Algorithm I The E-step takes the conditional expectation, denoted as Ẽ, of log p(Yn |θ) with respect to the density p(α|Yn , θ): log p(Yn |θ) = Ẽ (log p(α, Yn |θ)) − Ẽ (log p(α|Yn , θ)) I The M-step involves maximizing the likelihood with respect to θ: ∂ log p(α, Yn |θ) ∂ log p(Yn |θ) = Ẽ ∂θ ∂θ This leads to ( n ) X 1 1 1 2 1 2 + − ε − η =0 Ẽ σε2 ση2 σε4 t ση4 t−1 i=1 33 EM Algorithm The closed-form solution to the M-step is σ bε2 n n 1X 1X 2 Ẽ εt = = εb2t + Var (εt |Yn ) , n n t=1 σ bη2 = 1 n−1 t=1 n X t=2 2 Ẽ ηt−1 = n 1 X 2 ηbt−1 + Var (ηt−1 |Yn ) . n−1 t=2 The procedure then repeats itself with the new trial values of (b σε2 , σ bη2 ), until convergence has been attained. Given each trial of parameters, we need disturbance smoothing to update. 34 Summary for General Models We now move to the general model introduced at the beginning of this lecture. vector yt αt εt ηt p×1 m×1 p×1 r ×1 a1 m×1 matrix Zt Tt Ht Rt Qt P1 p×m m×m p×p m×r r ×r m×m 35 Kalman Filter Recursion There is no difference in the proof compared to what we have done, except that some matrix algebra is needed: vt = yt − Zt at , Ft = Zt Pt Zt| + Ht at|t = at + Pt Zt| Ft−1 vt , at+1 = Tt at + Kt vt , Pt|t = Pt − Pt Zt| Ft−1 Zt Pt , Pt+1 = Tt Pt (Tt − Kt Zt )| + Rt Qt Rt| , for t = 1, 2, . . . , n, where Kt = Tt Pt Zt| Ft−1 , and a1 and P1 are given. vector vt p×1 at at|t m×1 m×1 matrix Ft Kt Pt Pt|t p×p m×p m×m m×m 36 State and Disturbance Smoothing Recursion rt−1 =Zt| Ft−1 vt + L|t rt , α bt = at + Pt rt−1 , Nt−1 = Zt| Ft−1 Zt + L|t Nt Lt , Vt = Pt − Pt Nt−1 Pt , εbt = Ht (Ft−1 vt − Kt| rt ), ηbt = Qt Rt| rt , Var(εt |Yn ) = Ht − Ht (Ft−1 + Kt| Nt Kt )Ht , Var(ηt |Yn ) = Qt − Qt Rt| Nt Rt Qt , Lt = Tt − Kt Zt for t = n, n − 1, . . . , 1, initialized with rn = 0, and Nn = 0. vector rt α bt ut εbt ηbt m×1 m×1 p×1 p×1 r ×1 matrix Nt Vt Dt m×m m×m p×p 37 Missing Observations I If the entire vector yt is missing for t = τ, . . . , τ ∗ − 1, e.g., forecasting, then we should use Zt = 0 for t = τ, . . . , τ ∗ − 1 in at|t , Pt|t , at+1 , Pt+1 , rt−1 , Nt−1 , Kt , and Lt . I If certain elements of the vector yt is missing, e.g., asynchronous trading, we define yt∗ be the vector of values actually observed. Note that the dimension of yt∗ changes with t. And yt∗ = Wt yt , where Wt is the selection matrix whose rows are a subset of rows of I. It implies that yt∗ = Zt∗ αt + ε∗t , ε∗t ∼ N(0, Ht∗ ), where Zt∗ = Wt Zt , ε∗t = Wt εt , and Ht∗ = Wt Ht Wt| . 38