Time Series Analysis in Python with statsmodels Wes McKinney Josef Perktold Skipper Seabold

advertisement
Time Series Analysis in Python with statsmodels
Wes McKinney1
Josef Perktold2
Skipper Seabold3
1 Department
of Statistical Science
Duke University
2 Department of Economics
University of North Carolina at Chapel Hill
3 Department
of Economics
American University
10th Python in Science Conference, 13 July 2011
McKinney, Perktold, Seabold (statsmodels)
Python Time Series Analysis
SciPy Conference 2011
1 / 29
What is statsmodels?
A library for statistical modeling, implementing standard statistical
models in Python using NumPy and SciPy
Includes:
Linear (regression) models of many forms
Descriptive statistics
Statistical tests
Time series analysis
...and much more
McKinney, Perktold, Seabold (statsmodels)
Python Time Series Analysis
SciPy Conference 2011
2 / 29
What is Time Series Analysis?
Statistical modeling of time-ordered data observations
Inferring structure, forecasting and simulation, and testing
distributional assumptions about the data
Modeling dynamic relationships among multiple time series
Broad applications e.g. in economics, finance, neuroscience, signal
processing...
McKinney, Perktold, Seabold (statsmodels)
Python Time Series Analysis
SciPy Conference 2011
3 / 29
Talk Overview
Brief update on statsmodels development
Aside: user interface and data structures
Descriptive statistics and tests
Auto-regressive moving average models (ARMA)
Vector autoregression (VAR) models
Filtering tools (Hodrick-Prescott and others)
Near future: Bayesian dynamic linear models (DLMs), ARCH /
GARCH volatility models and beyond
McKinney, Perktold, Seabold (statsmodels)
Python Time Series Analysis
SciPy Conference 2011
4 / 29
Statsmodels development update
We’re now on GitHub! Join us:
http://github.com/statsmodels/statsmodels
Check out the slick Sphinx docs:
http://statsmodels.sourceforge.net
Development focus has been largely computational, i.e. writing
correct, tested implementations of all the common classes of
statistical models
McKinney, Perktold, Seabold (statsmodels)
Python Time Series Analysis
SciPy Conference 2011
5 / 29
Statsmodels development update
Major work to be done on providing a nice integrated user interface
We must work together to close the gap between R and Python!
Some important areas:
Formula framework, for specifying model design matrices
Need integrated rich statistical data structures (pandas)
Data visualization of results should always be a few keystrokes away
Write a “Statsmodels for R users” guide
McKinney, Perktold, Seabold (statsmodels)
Python Time Series Analysis
SciPy Conference 2011
6 / 29
Aside: statistical data structures and user interface
While I have a captive audience...
Controversial fact: pandas is the only Python library currently
providing data structures matching (and in many places exceeding)
the richness of R’s data structures (for statistics)
Let’s have a BoF session so I can justify this statement
Feedback I hear is that end users find the fragmented, incohesive set
of Python tools for data analysis and statistics to be confusing,
frustrating, and certainly not compelling them to use Python...
(Not to mention the packaging headaches)
McKinney, Perktold, Seabold (statsmodels)
Python Time Series Analysis
SciPy Conference 2011
7 / 29
Aside: statistical data structures and user interface
We need to “commit” ASAP (not 12 months from now) to a high
level data structure(s) as the “primary data structure(s) for statistical
data analysis” and communicate that clearly to end users
Or we might as well all start programming in R...
McKinney, Perktold, Seabold (statsmodels)
Python Time Series Analysis
SciPy Conference 2011
8 / 29
Example data: EEG trace data
300
200
100
0
100
200
300
400
500
600
0
500
0
100
McKinney, Perktold, Seabold (statsmodels)
0
150
0
200
250
Python Time Series Analysis
0
0
300
0
350
400
0
SciPy Conference 2011
9 / 29
Example data: Macroeconomic data
5.5
5.0
cpi
4.5
4.0
3.5
3.0
7.5
7.0
m1
6.5
6.0
5.5
5.0
4.5
9.5
realgdp
9.0
8.5
8.0
2
2
8
8
0
0
6
6
4
4
8
0
4
196 196 196 197 197 198 198 198 199 199 200 200 200
McKinney, Perktold, Seabold (statsmodels)
Python Time Series Analysis
SciPy Conference 2011
10 / 29
Example data: Stock data
800
700
600
AAPL
GOOG
MSFT
YHOO
500
400
300
200
100
0
1
200
2
200
McKinney, Perktold, Seabold (statsmodels)
3
200
200
4
200
5
6
200
Python Time Series Analysis
200
7
8
200
9
200
SciPy Conference 2011
11 / 29
Descriptive statistics
Autocorrelation, partial autocorrelation plots
Commonly used for identification in ARMA(p,q) and ARIMA(p,d,q)
models
acf = tsa . acf ( eeg , 50)
pacf = tsa . pacf ( eeg , 50)
Autocorrelation
1.0
0.5
0.5
0.0
0.0
0.5
0.5
1.00
10
20
30
McKinney, Perktold, Seabold (statsmodels)
Partial Autocorrelation
1.0
40
50
1.00
Python Time Series Analysis
10
20
30
40
SciPy Conference 2011
50
12 / 29
Statistical tests
Ljung-Box test for zero autocorrelation
Unit root test for cointegration (Augmented Dickey-Fuller test)
Granger-causality
Whiteness (iid-ness) and normality
See our conference paper (when the proceedings get published!)
McKinney, Perktold, Seabold (statsmodels)
Python Time Series Analysis
SciPy Conference 2011
13 / 29
Autoregressive moving average (ARMA) models
One of most common univariate time series models:
yt = µ + a1 yt−1 + ... + ak yt−p + t + b1 t−1 + ... + bq t−q
where E (t , s ) = 0, for t 6= s and t ∼ N (0, σ 2 )
Exact log-likelihood can be evaluated via the Kalman filter, but the
“conditional” likelihood is easier and commonly used
statsmodels has tools for simulating ARMA processes with known
coefficients ai , bi and also estimation given specified lag orders
import scikits.statsmodels.tsa.arima_process as ap
ar_coef = [1, .75, -.25]; ma_coef = [1, -.5]
nobs = 100
y = ap.arma_generate_sample(ar_coef, ma_coef, nobs)
y += 4 # add in constant
McKinney, Perktold, Seabold (statsmodels)
Python Time Series Analysis
SciPy Conference 2011
14 / 29
ARMA Estimation
Several likelihood-based estimators implemented (see docs)
model = tsa.ARMA(y)
result = model.fit(order=(2, 1), trend=’c’,
method=’css-mle’, disp=-1)
result.params
# array([ 3.97, -0.97, -0.05, -0.13])
Standard model diagnostics, standard errors, information criteria
(AIC, BIC, ...), etc available in the returned ARMAResults object
McKinney, Perktold, Seabold (statsmodels)
Python Time Series Analysis
SciPy Conference 2011
15 / 29
Vector Autoregression (VAR) models
Widely used model for modeling multiple (K -variate) time series,
especially in macroeconomics:
Yt = A1 Yt−1 + . . . + Ap Yt−p + t , t ∼ N (0, Σ)
Matrices Ai are K × K .
Yt must be a stationary process (sometimes achieved by
differencing). Related class of models (VECM) for modeling
nonstationary (including cointegrated) processes
McKinney, Perktold, Seabold (statsmodels)
Python Time Series Analysis
SciPy Conference 2011
16 / 29
Vector Autoregression (VAR) models
>>> model = VAR(data); model.select_order(8)
VAR Order Selection
=====================================================
aic
bic
fpe
hqic
----------------------------------------------------0
-27.83
-27.78
8.214e-13
-27.81
1
-28.77
-28.57
3.189e-13
-28.69
2
-29.00
-28.64*
2.556e-13
-28.85
3
-29.10
-28.60
2.304e-13
-28.90*
4
-29.09
-28.43
2.330e-13
-28.82
5
-29.13
-28.33
2.228e-13
-28.81
6
-29.14*
-28.18
2.213e-13*
-28.75
7
-29.07
-27.96
2.387e-13
-28.62
=====================================================
* Minimum
McKinney, Perktold, Seabold (statsmodels)
Python Time Series Analysis
SciPy Conference 2011
17 / 29
Vector Autoregression (VAR) models
>>> result = model.fit(2)
>>> result.summary() # print summary for each variable
<snip>
Results for equation m1
====================================================
coefficient
std. error t-stat
prob
---------------------------------------------------const
0.004968
0.001850
2.685 0.008
L1.m1
0.363636
0.071307
5.100 0.000
L1.realgdp
-0.077460
0.092975 -0.833 0.406
L1.cpi
-0.052387
0.128161 -0.409 0.683
L2.m1
0.250589
0.072050
3.478 0.001
L2.realgdp
-0.085874
0.092032 -0.933 0.352
L2.cpi
0.169803
0.128376
1.323 0.188
====================================================
<snip>
McKinney, Perktold, Seabold (statsmodels)
Python Time Series Analysis
SciPy Conference 2011
18 / 29
Vector Autoregression (VAR) models
>>> result = model.fit(2)
>>> result.summary() # print summary for each variable
<snip>
Correlation matrix of residuals
m1
realgdp
cpi
m1
1.000000 -0.055690 -0.297494
realgdp
-0.055690 1.000000 0.115597
cpi
-0.297494 0.115597 1.000000
McKinney, Perktold, Seabold (statsmodels)
Python Time Series Analysis
SciPy Conference 2011
19 / 29
VAR: Impulse Response analysis
Analyze systematic impact of unit “shock” to a single variable
irf = result.irf(10)
irf.plot()
1.0
0.8
0.6
0.4
0.2
0.0
0.20
0.20
0.15
0.10
0.05
0.00
0.05
0.10
0.150
0.20
0.15
0.10
0.05
0.00
0.05
0.100
Impulse responses
m1 → m1
2
6
m14→ realgdp
8
2
4 → cpi6
m1
8
2
4
8
6
McKinney, Perktold, Seabold (statsmodels)
0.2
0.1
0.0
0.1
0.2
0.3
0.4
10
0
1.0
0.8
0.6
0.4
0.2
0.0
10 0.20
0.15
0.10
0.05
0.00
0.05
0.10
10 0.150
realgdp → m1
4 → realgdp
2 realgdp
6
8
2
2
4 →6cpi
realgdp
4
6
8
8
0.4
0.3
0.2
0.1
0.0
0.1
0.2
0.3
10 0.40
0.2
0.1
0.0
0.1
0.2
0.3
0.4
10
0
1.0
0.8
0.6
0.4
0.2
10 0.00
Python Time Series Analysis
cpi → m1
2
6
cpi4→ realgdp
8
10
2
4cpi → cpi6
8
10
2
4
8
10
6
SciPy Conference 2011
20 / 29
VAR: Forecast Error Variance Decomposition
Analyze contribution of each variable to forecasting error
fevd = result.fevd(20)
fevd.plot()
Forecast error variance decomposition (FEVD)
1.0
0.8
0.6
0.4
0.2
0.00
1.2
1.0
0.8
0.6
0.4
0.2
0.00
1.2
1.0
0.8
0.6
0.4
0.2
0.00
McKinney, Perktold, Seabold (statsmodels)
m1
realgdp
cpi
m1
5
10
realgdp
15
20
5
10
cpi
15
20
5
10
15
20
Python Time Series Analysis
SciPy Conference 2011
21 / 29
VAR: Statistical tests
In [137]: result.test_causality(’m1’, [’cpi’, ’realgdp’])
Granger causality f-test
=========================================================
Test statistic
Critical Value
p-value
df
--------------------------------------------------------1.248787
2.387325
0.289 (4, 579)
=========================================================
H_0: [’cpi’, ’realgdp’] do not Granger-cause m1
Conclusion: fail to reject H_0 at 5.00% significance level
McKinney, Perktold, Seabold (statsmodels)
Python Time Series Analysis
SciPy Conference 2011
22 / 29
Filtering
Hodrick-Prescott (HP) filter separates a time series yt into a trend τt
and a cyclical component ζt , so that yt = τt + ζt .
14
Inflation
Cyclical component
Trend component
12
10
8
6
4
2
0
2
4
2
196
6
196
0
197
4
197
McKinney, Perktold, Seabold (statsmodels)
8
197
2
198
6
198
0
199
Python Time Series Analysis
4
199
8
199
2
200
6
200
SciPy Conference 2011
23 / 29
Filtering
In addition to the HP filter, 2 other filters popular in finance and
economics, Baxter-King and Christiano-Fitzgerald, are available
We refer you to our paper and the documentation for details on these:
Inflation and Unemployment: CF Filtered
Inflation and Unemployment: BK Filtered
Python Time Series Analysis
08
20
98
03
20
93
19
19
83
88
19
73
78
19
19
19
63
68
19
06
20
20
19
19
19
19
19
19
19
96
4
01
4
91
2
81
2
86
0
76
0
71
2
66
2
McKinney, Perktold, Seabold (statsmodels)
INFL
UNEMP
4
19
INFL
UNEMP
4
SciPy Conference 2011
24 / 29
Preview: Bayesian dynamic linear models (DLM)
A state space model by another name:
yt = Ft0 θt + νt ,
θt = G θt−1 + ωt ,
νt ∼ N (0, Vt )
ωt ∼ N (0, Wt )
Estimation of basic model by Kalman filter recursions. Provides
elegant way to do time-varying linear regressions for forecasting
Extensions: multivariate DLMs, stochastic volatility (SV) models,
MCMC-based posterior sampling, mixtures of DLMs
McKinney, Perktold, Seabold (statsmodels)
Python Time Series Analysis
SciPy Conference 2011
25 / 29
Preview: DLM Example (Constant+Trend model)
model = Polynomial(2)
dlm = DLM(close_px[’AAPL’], model.F, G=model.G, # model
m0=m0, C0=C0, n0=n0, s0=s0, # priors
state_discount=.95) # discount factor
Constant + Trend DLM
200
150
100
50
8
200
Nov
Jan
McKinney, Perktold, Seabold (statsmodels)
9
200
Mar
9
200
2
May
009
009
Jul 2
Python Time Series Analysis
Sep
200
9
Nov
200
9
SciPy Conference 2011
26 / 29
Preview: Stochastic volatility models
JPY-USD Exchange Rate Volatility Process
1.6
1.4
1.2
1.0
0.8
0.6
0.4
0.20
200
McKinney, Perktold, Seabold (statsmodels)
400
600
Python Time Series Analysis
800
1000
SciPy Conference 2011
27 / 29
Future: sandbox and beyond
ARCH / GARCH models for volatility
Structural VAR and error correction models (ECM) for cointegrated
processes
Models with non-normally distributed errors
Better data description, visualization, and interactive research tools
More sophisticated Bayesian time series models
McKinney, Perktold, Seabold (statsmodels)
Python Time Series Analysis
SciPy Conference 2011
28 / 29
Conclusions
We’ve implemented many foundational models for time series
analysis, but the field is very broad
User interface can and should be much improved
Repo: http://github.com/statsmodels/statsmodels
Docs: http://statsmodels.sourceforge.net
Contact: pystatsmodels@googlegroups.com
McKinney, Perktold, Seabold (statsmodels)
Python Time Series Analysis
SciPy Conference 2011
29 / 29
Download