Review from last time 1 14.384 Time Series Analysis, Fall 2007 Professor Anna Mikusheva Paul Schrimpf, scribe October 23, 2007 revised October 22, 2009 Lecture 17 Unit Roots Review from last time Let yt be a random walk yt = ρyt + t , ρ = 1 where t is a martingale difference sequence (E[t |It ] = 0), with Then {t } satisfy a functional central limit theorem, 1 T P E[2t |It−1 ] → σ 2 a.s. and E4t < K < ∞. [τ T ] 1 X ξT (τ ) = √ t ⇒ σW (·) T t=1 We showed last time that: 1 yt−1 t T P 1 y T 3/2P t−1 1 2 y t−1 T2 P ⇒ R1 W (t)dW (t) 0R 1 σ 0 W (t)dt R 1 σ 2 0 W (t)2 dt σ2 and our OLS estimator and t-statistic have non-standard distributions: R W dW T (ρ̂ − ρ) ⇒ R 2 W dt R W dW t ⇒ qR W 2 dt This is quite different from the case with |ρ| < 1. If xt = ρxt−1 + t , |ρ| < 1 then, √1 T 1 T 1 T P σ4 N (0, 1−ρ xt−1 t 2) P ⇒ Ext = 0 P xt2−1 2 σ xt−1 Ex2t = 1−ρ 2 and the OLS estimate and t-stat converge to normal distributions: √ T (ρ̂ − ρ) ⇒N (0, (1 − ρ2 )) t ⇒N (0, 1) Cite as: Anna Mikusheva, course materials for 14.384 Time Series Analysis, Fall 2007. MIT OpenCourseWare (http://ocw.mit.edu), Massachusetts Institute of Technology. Downloaded on [DD Month YYYY]. Adding a constant 2 Adding a constant Suppose we add a constant to our OLS regression of yt . This is equivalent to running OLS on demeaned yt . Let ytm = yt − ȳ. Then, P m y t ρˆ − 1 = P t−1 m )2 (yt−1 Consider the numerator: 1X m 1X yt−1 t = (yt−1 − ȳ)t T T 1X = yt−1 t − ȳ¯ T R P P P 1 1 ȳ¯ = T 3/2 yt−1 √1T t . We know that T 3/2 yt−1 ⇒ σ W dt, and √1T t ⇒ σW (1). Also, from before R P we know that T1 yt−1 t ⇒ σ 2 W dW . Combining this, we have: Z Z 1X m yt−1 t ⇒σ 2 W (s)dW (s) − W (1) W (t)dt T We can think of this as the integral of a demeaned Brownian motion, Z Z Z Z W (s)dW (s) − W (1) W (t)dt = W (s) − W (t)dt dW (s) Z = W m (s)dW (s) The limiting distributions of all statistics (ρ̂, t, etc) would change. In most cases, the change is only in replacing W by W m . One also can include a linear trend in the regression, then the limiting distribution would depend on a detrended Brownian motion. Limiting Distribution Let’s return to the case without a constant. The limiting distribution of the OLS esitmate is T (ρ̂ − 1) ⇒ R R W dW . This distribution is skewed and shifted to the left. If we include a constant, the distribution is even W 2 dt more shifted. If we also include a trend, the distribution shifts yet more. For example, the 2.5%-tile without a constant is -10.5, with a constant is -16.9, and with a trend is -25. The 97.5%-tiles are 1.6, 0.41, and -1.8, respectively. Thus, estimates of ρ have bias of order T1 . This bias is quite large in small samples. The distribution of the t-statistic is also shifted to the left and skewed, but less so than the distribution of the OLS estimate. The 2.5%-tiles are -2.23, -3.12, -3.66 and of the 97.5%-tiles are 1.6, 0.23, and -0.66 for no constant, with a constant, and with a trend, respectively. Allowing Auto-correlation So far, we have assumed that t are not auto-correlated. This assumption is too strong for empirical work. Let, yt = yt−1 + vt where vt is a zero-mean stationary process with an MA representation, vt = ∞ X cj j j=0 Cite as: Anna Mikusheva, course materials for 14.384 Time Series Analysis, Fall 2007. MIT OpenCourseWare (http://ocw.mit.edu), Massachusetts Institute of Technology. Downloaded on [DD Month YYYY]. Augmented Dickey Fuller 3 where jPis white Now we need P noise. P Pto look at the limiting distributions of each of the terms we looked at before ( vt , yt−1 vt , yt2 , etc). vt is a term we already encountered in HAC estimation: T 1 X √ vt ⇒ N (0, ω 2 ) T t=1 where ω 2 is the long-term variance, ω2 = ∞ X γj = σ 2 c(1)2 ≡ ω 2 j=−∞ For linearly dependent process, vt , we have the following central limit theorem: [T τ ] 1 X ξT (τ ) = √ vt ⇒ ωW (·) T t=1 This can be proven by using the Beveridge-Nelson decomposition. All the other terms converge to things similar to the uncorrelated case, except with σ replaced by ω. For example, y √ = ξT (t/T ) ⇒ ωW (t/T ) T Z 1 X ⇒ ω W (s)ds y t−1 T 3/2 An exception is: 1X 1 1 X 2 yt−1 vt = yT2 − vt T 2T 2T 1 ⇒ (ω 2 W (1)2 − σv2 ) 2 Z ω 2 − σv2 2 ⇒ω W dW + 2 This leads to an extra constant in the distribution of ρ̂: R T (ρ̂ − 1) ⇒ ω 2 −σv2 ω2 W dW − 12 R W 2 dt This additional constant is “nuisance” parameter, and thus, the statistic is not pivotal. So it is impossible to just look up critical values in a table. Phillips & Perron (1988) suggested corrected statistics: R 1 2 ˆ − σ̂v2 W dW 2ω T (ρ̂ − 1) + 1 P 2 ⇒ R 2 W dt yt T2 where ω̂ 2 is an estimate of the long-run variance, say Newey-West, and σ̂v2 is the variance of the residuals. Augmented Dickey Fuller Another approach is the augmented Dickey-Fuller test. Suppose, yt = p X aj yt−j + t j=1 Cite as: Anna Mikusheva, course materials for 14.384 Time Series Analysis, Fall 2007. MIT OpenCourseWare (http://ocw.mit.edu), Massachusetts Institute of Technology. Downloaded on [DD Month YYYY]. Augmented Dickey Fuller 4 where z = 1 is a root, 1− p X aj z j =0 j=1 1= X aj Suppose we factor a(L) as 1− with b(L) = 1 − Pp−1 j=1 p X aj Lj = (1 − L)b(L) βj Lj . We can write the model as ∆yt = p−1 X βj ∆yt−j + t j=1 This suggests estimating: yt = ρyt−1 + p−1 X βj ∆yt−j + t j=1 and testing whether ρ = 1 to see if we have a unit root. This is another way of allowing auto-correlation because we can write the model as: a(L)yt =t b(L)∆yt =t ∆yt =b(L)−1 t yt = yt−1 + vt where vt = b(L)−1 t is auto-correlated. The nice thing about the augemented Dickey-Fuller test is that the coefficients on ∆yt−j each converge √ to a normal distribution at rate T , and the coefficient on yt−1 converges to a non-standard distribution 0 without nuisance parameters. To see this, let xt = [yt−1 , ∆yt−1 , ..., ∆yt−p+1 ] and let θ = [ρ, β1 , ..., βp−1 ] . Consider: θ̂ − θ = (X 0 X)−1 (X 0 ). We need to normalize this so that it converges. Our normalizing matrix will be T √... 0 0 T 0 Q= . . .. .. √ 0 T Multiplying by Q gives: Q(θ̂ − θ) =(Q−1 (X 0 X)−1 Q−1 )−1 (Q−1 X 0 ) The denominator is: Q−1 (X 0 X)−1 Q−1 = 1 2 TP yt2−1 yt−1 ∆yt−1 .. . P 1 T 3/2 1 T 3/2 P yt−1 ∆yt−1 1 ˜0 ˜ XX T ... Cite as: Anna Mikusheva, course materials for 14.384 Time Series Analysis, Fall 2007. MIT OpenCourseWare (http://ocw.mit.edu), Massachusetts Institute of Technology. Downloaded on [DD Month YYYY]. Other Tests Note that know that 5 P p → 0 (since we proved that T1 yt−1 ∆yt−j ⇒ 12 (ω 2 W (1)2 − σv2 )). Also, we 2 W dt, so 2R 2 ω W dt 0 ... EX̃ 0 X̃ 0 Q−1 (X 0 X)−1 Q−1 ⇒ .. . 1 y ∆y T 3/2 P 2 t−1 t2−Rj 1 y t −1 ⇒ ω T2 P Similarly, Q−1 X 0 = 1 T √1 T R P σ ω W dW P yt−1 t ⇒ ∆yt−j t N (0, E [X̃ 0 X̃ ]σ 2 ) Thus, T (ρ̂ − 1) ⇒ R σ W dW R ω W 2 dt or R W dW ω̂ T (ρ̂ − 1) T (ρ̂ − 1) = Pˆ ⇒ R 2 σ̂ W dt 1 − βj Other Tests Sargan-Bhargrava (1983) For a non-stationary process, the variance of yt grows with time, so we could look at: P 2 1 y T2 P t−1 1 2 ∆y t−1 T when t is an m.d.s., this converges to a distribution, P 2 1 yt−1 T 2P 1 2 ∆yt− 1 T ⇒ R W 2 (dt). For a stationary process, this converges to 0 in probability. Range (Mandelbrot and coauthors) You could also look at the range of yt , this should blow up for non-stationary processes for a random walk, we have: √1 T (max yt − min yt ) q P ⇒ sup W (λ) − inf W (λ) λ 1 2 λ ∆y t T Test Comparison The Sargan-Bhargrava (1983) test has optimal asymptotic power against local alternatives, but has bad size distortions in small samples, especially if errors are negatively autocorrelated. The augmented Dickey-Fuller test with the BIC used to choose lag-length has overall good size in finite samples. Cite as: Anna Mikusheva, course materials for 14.384 Time Series Analysis, Fall 2007. MIT OpenCourseWare (http://ocw.mit.edu), Massachusetts Institute of Technology. Downloaded on [DD Month YYYY]. MIT OpenCourseWare http://ocw.mit.edu 14.384 Time Series Analysis Fall 2013 For information about citing these materials or our Terms of Use, visit: http://ocw.mit.edu/terms.