Time Series Analysis in Python with statsmodels Wes McKinney1 Josef Perktold2 Skipper Seabold3 1 Department of Statistical Science Duke University 2 Department of Economics University of North Carolina at Chapel Hill 3 Department of Economics American University 10th Python in Science Conference, 13 July 2011 McKinney, Perktold, Seabold (statsmodels) Python Time Series Analysis SciPy Conference 2011 1 / 29 What is statsmodels? A library for statistical modeling, implementing standard statistical models in Python using NumPy and SciPy Includes: Linear (regression) models of many forms Descriptive statistics Statistical tests Time series analysis ...and much more McKinney, Perktold, Seabold (statsmodels) Python Time Series Analysis SciPy Conference 2011 2 / 29 What is Time Series Analysis? Statistical modeling of time-ordered data observations Inferring structure, forecasting and simulation, and testing distributional assumptions about the data Modeling dynamic relationships among multiple time series Broad applications e.g. in economics, finance, neuroscience, signal processing... McKinney, Perktold, Seabold (statsmodels) Python Time Series Analysis SciPy Conference 2011 3 / 29 Talk Overview Brief update on statsmodels development Aside: user interface and data structures Descriptive statistics and tests Auto-regressive moving average models (ARMA) Vector autoregression (VAR) models Filtering tools (Hodrick-Prescott and others) Near future: Bayesian dynamic linear models (DLMs), ARCH / GARCH volatility models and beyond McKinney, Perktold, Seabold (statsmodels) Python Time Series Analysis SciPy Conference 2011 4 / 29 Statsmodels development update We’re now on GitHub! Join us: http://github.com/statsmodels/statsmodels Check out the slick Sphinx docs: http://statsmodels.sourceforge.net Development focus has been largely computational, i.e. writing correct, tested implementations of all the common classes of statistical models McKinney, Perktold, Seabold (statsmodels) Python Time Series Analysis SciPy Conference 2011 5 / 29 Statsmodels development update Major work to be done on providing a nice integrated user interface We must work together to close the gap between R and Python! Some important areas: Formula framework, for specifying model design matrices Need integrated rich statistical data structures (pandas) Data visualization of results should always be a few keystrokes away Write a “Statsmodels for R users” guide McKinney, Perktold, Seabold (statsmodels) Python Time Series Analysis SciPy Conference 2011 6 / 29 Aside: statistical data structures and user interface While I have a captive audience... Controversial fact: pandas is the only Python library currently providing data structures matching (and in many places exceeding) the richness of R’s data structures (for statistics) Let’s have a BoF session so I can justify this statement Feedback I hear is that end users find the fragmented, incohesive set of Python tools for data analysis and statistics to be confusing, frustrating, and certainly not compelling them to use Python... (Not to mention the packaging headaches) McKinney, Perktold, Seabold (statsmodels) Python Time Series Analysis SciPy Conference 2011 7 / 29 Aside: statistical data structures and user interface We need to “commit” ASAP (not 12 months from now) to a high level data structure(s) as the “primary data structure(s) for statistical data analysis” and communicate that clearly to end users Or we might as well all start programming in R... McKinney, Perktold, Seabold (statsmodels) Python Time Series Analysis SciPy Conference 2011 8 / 29 Example data: EEG trace data 300 200 100 0 100 200 300 400 500 600 0 500 0 100 McKinney, Perktold, Seabold (statsmodels) 0 150 0 200 250 Python Time Series Analysis 0 0 300 0 350 400 0 SciPy Conference 2011 9 / 29 Example data: Macroeconomic data 5.5 5.0 cpi 4.5 4.0 3.5 3.0 7.5 7.0 m1 6.5 6.0 5.5 5.0 4.5 9.5 realgdp 9.0 8.5 8.0 2 2 8 8 0 0 6 6 4 4 8 0 4 196 196 196 197 197 198 198 198 199 199 200 200 200 McKinney, Perktold, Seabold (statsmodels) Python Time Series Analysis SciPy Conference 2011 10 / 29 Example data: Stock data 800 700 600 AAPL GOOG MSFT YHOO 500 400 300 200 100 0 1 200 2 200 McKinney, Perktold, Seabold (statsmodels) 3 200 200 4 200 5 6 200 Python Time Series Analysis 200 7 8 200 9 200 SciPy Conference 2011 11 / 29 Descriptive statistics Autocorrelation, partial autocorrelation plots Commonly used for identification in ARMA(p,q) and ARIMA(p,d,q) models acf = tsa . acf ( eeg , 50) pacf = tsa . pacf ( eeg , 50) Autocorrelation 1.0 0.5 0.5 0.0 0.0 0.5 0.5 1.00 10 20 30 McKinney, Perktold, Seabold (statsmodels) Partial Autocorrelation 1.0 40 50 1.00 Python Time Series Analysis 10 20 30 40 SciPy Conference 2011 50 12 / 29 Statistical tests Ljung-Box test for zero autocorrelation Unit root test for cointegration (Augmented Dickey-Fuller test) Granger-causality Whiteness (iid-ness) and normality See our conference paper (when the proceedings get published!) McKinney, Perktold, Seabold (statsmodels) Python Time Series Analysis SciPy Conference 2011 13 / 29 Autoregressive moving average (ARMA) models One of most common univariate time series models: yt = µ + a1 yt−1 + ... + ak yt−p + t + b1 t−1 + ... + bq t−q where E (t , s ) = 0, for t 6= s and t ∼ N (0, σ 2 ) Exact log-likelihood can be evaluated via the Kalman filter, but the “conditional” likelihood is easier and commonly used statsmodels has tools for simulating ARMA processes with known coefficients ai , bi and also estimation given specified lag orders import scikits.statsmodels.tsa.arima_process as ap ar_coef = [1, .75, -.25]; ma_coef = [1, -.5] nobs = 100 y = ap.arma_generate_sample(ar_coef, ma_coef, nobs) y += 4 # add in constant McKinney, Perktold, Seabold (statsmodels) Python Time Series Analysis SciPy Conference 2011 14 / 29 ARMA Estimation Several likelihood-based estimators implemented (see docs) model = tsa.ARMA(y) result = model.fit(order=(2, 1), trend=’c’, method=’css-mle’, disp=-1) result.params # array([ 3.97, -0.97, -0.05, -0.13]) Standard model diagnostics, standard errors, information criteria (AIC, BIC, ...), etc available in the returned ARMAResults object McKinney, Perktold, Seabold (statsmodels) Python Time Series Analysis SciPy Conference 2011 15 / 29 Vector Autoregression (VAR) models Widely used model for modeling multiple (K -variate) time series, especially in macroeconomics: Yt = A1 Yt−1 + . . . + Ap Yt−p + t , t ∼ N (0, Σ) Matrices Ai are K × K . Yt must be a stationary process (sometimes achieved by differencing). Related class of models (VECM) for modeling nonstationary (including cointegrated) processes McKinney, Perktold, Seabold (statsmodels) Python Time Series Analysis SciPy Conference 2011 16 / 29 Vector Autoregression (VAR) models >>> model = VAR(data); model.select_order(8) VAR Order Selection ===================================================== aic bic fpe hqic ----------------------------------------------------0 -27.83 -27.78 8.214e-13 -27.81 1 -28.77 -28.57 3.189e-13 -28.69 2 -29.00 -28.64* 2.556e-13 -28.85 3 -29.10 -28.60 2.304e-13 -28.90* 4 -29.09 -28.43 2.330e-13 -28.82 5 -29.13 -28.33 2.228e-13 -28.81 6 -29.14* -28.18 2.213e-13* -28.75 7 -29.07 -27.96 2.387e-13 -28.62 ===================================================== * Minimum McKinney, Perktold, Seabold (statsmodels) Python Time Series Analysis SciPy Conference 2011 17 / 29 Vector Autoregression (VAR) models >>> result = model.fit(2) >>> result.summary() # print summary for each variable <snip> Results for equation m1 ==================================================== coefficient std. error t-stat prob ---------------------------------------------------const 0.004968 0.001850 2.685 0.008 L1.m1 0.363636 0.071307 5.100 0.000 L1.realgdp -0.077460 0.092975 -0.833 0.406 L1.cpi -0.052387 0.128161 -0.409 0.683 L2.m1 0.250589 0.072050 3.478 0.001 L2.realgdp -0.085874 0.092032 -0.933 0.352 L2.cpi 0.169803 0.128376 1.323 0.188 ==================================================== <snip> McKinney, Perktold, Seabold (statsmodels) Python Time Series Analysis SciPy Conference 2011 18 / 29 Vector Autoregression (VAR) models >>> result = model.fit(2) >>> result.summary() # print summary for each variable <snip> Correlation matrix of residuals m1 realgdp cpi m1 1.000000 -0.055690 -0.297494 realgdp -0.055690 1.000000 0.115597 cpi -0.297494 0.115597 1.000000 McKinney, Perktold, Seabold (statsmodels) Python Time Series Analysis SciPy Conference 2011 19 / 29 VAR: Impulse Response analysis Analyze systematic impact of unit “shock” to a single variable irf = result.irf(10) irf.plot() 1.0 0.8 0.6 0.4 0.2 0.0 0.20 0.20 0.15 0.10 0.05 0.00 0.05 0.10 0.150 0.20 0.15 0.10 0.05 0.00 0.05 0.100 Impulse responses m1 → m1 2 6 m14→ realgdp 8 2 4 → cpi6 m1 8 2 4 8 6 McKinney, Perktold, Seabold (statsmodels) 0.2 0.1 0.0 0.1 0.2 0.3 0.4 10 0 1.0 0.8 0.6 0.4 0.2 0.0 10 0.20 0.15 0.10 0.05 0.00 0.05 0.10 10 0.150 realgdp → m1 4 → realgdp 2 realgdp 6 8 2 2 4 →6cpi realgdp 4 6 8 8 0.4 0.3 0.2 0.1 0.0 0.1 0.2 0.3 10 0.40 0.2 0.1 0.0 0.1 0.2 0.3 0.4 10 0 1.0 0.8 0.6 0.4 0.2 10 0.00 Python Time Series Analysis cpi → m1 2 6 cpi4→ realgdp 8 10 2 4cpi → cpi6 8 10 2 4 8 10 6 SciPy Conference 2011 20 / 29 VAR: Forecast Error Variance Decomposition Analyze contribution of each variable to forecasting error fevd = result.fevd(20) fevd.plot() Forecast error variance decomposition (FEVD) 1.0 0.8 0.6 0.4 0.2 0.00 1.2 1.0 0.8 0.6 0.4 0.2 0.00 1.2 1.0 0.8 0.6 0.4 0.2 0.00 McKinney, Perktold, Seabold (statsmodels) m1 realgdp cpi m1 5 10 realgdp 15 20 5 10 cpi 15 20 5 10 15 20 Python Time Series Analysis SciPy Conference 2011 21 / 29 VAR: Statistical tests In [137]: result.test_causality(’m1’, [’cpi’, ’realgdp’]) Granger causality f-test ========================================================= Test statistic Critical Value p-value df --------------------------------------------------------1.248787 2.387325 0.289 (4, 579) ========================================================= H_0: [’cpi’, ’realgdp’] do not Granger-cause m1 Conclusion: fail to reject H_0 at 5.00% significance level McKinney, Perktold, Seabold (statsmodels) Python Time Series Analysis SciPy Conference 2011 22 / 29 Filtering Hodrick-Prescott (HP) filter separates a time series yt into a trend τt and a cyclical component ζt , so that yt = τt + ζt . 14 Inflation Cyclical component Trend component 12 10 8 6 4 2 0 2 4 2 196 6 196 0 197 4 197 McKinney, Perktold, Seabold (statsmodels) 8 197 2 198 6 198 0 199 Python Time Series Analysis 4 199 8 199 2 200 6 200 SciPy Conference 2011 23 / 29 Filtering In addition to the HP filter, 2 other filters popular in finance and economics, Baxter-King and Christiano-Fitzgerald, are available We refer you to our paper and the documentation for details on these: Inflation and Unemployment: CF Filtered Inflation and Unemployment: BK Filtered Python Time Series Analysis 08 20 98 03 20 93 19 19 83 88 19 73 78 19 19 19 63 68 19 06 20 20 19 19 19 19 19 19 19 96 4 01 4 91 2 81 2 86 0 76 0 71 2 66 2 McKinney, Perktold, Seabold (statsmodels) INFL UNEMP 4 19 INFL UNEMP 4 SciPy Conference 2011 24 / 29 Preview: Bayesian dynamic linear models (DLM) A state space model by another name: yt = Ft0 θt + νt , θt = G θt−1 + ωt , νt ∼ N (0, Vt ) ωt ∼ N (0, Wt ) Estimation of basic model by Kalman filter recursions. Provides elegant way to do time-varying linear regressions for forecasting Extensions: multivariate DLMs, stochastic volatility (SV) models, MCMC-based posterior sampling, mixtures of DLMs McKinney, Perktold, Seabold (statsmodels) Python Time Series Analysis SciPy Conference 2011 25 / 29 Preview: DLM Example (Constant+Trend model) model = Polynomial(2) dlm = DLM(close_px[’AAPL’], model.F, G=model.G, # model m0=m0, C0=C0, n0=n0, s0=s0, # priors state_discount=.95) # discount factor Constant + Trend DLM 200 150 100 50 8 200 Nov Jan McKinney, Perktold, Seabold (statsmodels) 9 200 Mar 9 200 2 May 009 009 Jul 2 Python Time Series Analysis Sep 200 9 Nov 200 9 SciPy Conference 2011 26 / 29 Preview: Stochastic volatility models JPY-USD Exchange Rate Volatility Process 1.6 1.4 1.2 1.0 0.8 0.6 0.4 0.20 200 McKinney, Perktold, Seabold (statsmodels) 400 600 Python Time Series Analysis 800 1000 SciPy Conference 2011 27 / 29 Future: sandbox and beyond ARCH / GARCH models for volatility Structural VAR and error correction models (ECM) for cointegrated processes Models with non-normally distributed errors Better data description, visualization, and interactive research tools More sophisticated Bayesian time series models McKinney, Perktold, Seabold (statsmodels) Python Time Series Analysis SciPy Conference 2011 28 / 29 Conclusions We’ve implemented many foundational models for time series analysis, but the field is very broad User interface can and should be much improved Repo: http://github.com/statsmodels/statsmodels Docs: http://statsmodels.sourceforge.net Contact: pystatsmodels@googlegroups.com McKinney, Perktold, Seabold (statsmodels) Python Time Series Analysis SciPy Conference 2011 29 / 29