Spurious Common Factors - Loughborough University

advertisement
Spurious Common Factors
Bettina Becker a, *, Stephen G. Hall b
a
School of Business and Economics, Loughborough University, Loughborough, LE11 3TU,
United Kingdom. E-mail address: b.becker@lboro.ac.uk
b
Department of Economics, University of Leicester, University Road, Leicester, LE1 7RH,
United Kingdom. E-mail address: s.g.hall@le.ac.uk
Abstract
We conduct Monte Carlo simulations of principal components analyses of unrelated time
series in order to investigate whether the stationarity properties of the data matter, as they do
for least-squares regression analysis. We find that for stationary series the results are standard
and reflect the lack of a relationship. For non-stationary series however spurious common
factors may persist in large samples.
Keywords:
Common factor analysis; Principal components; Spurious regression; Non-stationary data
JEL Codes:
C2; C5; C8
_________
* Corresponding author. Tel.: +44 (0)1509 222719, fax: +44 (0)1509 223910
1.
Introduction
Common factor analysis with principal components (PCs) is frequently used for data
reduction of economic time series. This methodology has been applied to stationary as well
as non-stationary series and has typically appeared to be effective for the analysis of either.
Principal components analyses (PCAs) of non-stationary series have been carried out on the
assumption that, in contrast with least-squares regression analysis, the non-stationarity would
not matter. We conduct Monte Carlo simulations of PCAs of unrelated stationary versus nonstationary series in order to investigate whether this is in fact the case.
2.
Theoretical background
2.1
The spurious regression problem of unrelated integrated series
Much conventional asymptotic theory for ordinary least-squares estimators relies on
stationarity of the explanatory variables. Many economic series are non-stationary however,
and often stationarity is not even a reasonable approximation. In the terminology of Granger
and Newbold (1974), applying to non-stationary data regression methods developed for
stationary data gives rise to the spurious regression problem: Regression of two completely
unrelated but integrated series on each other will tend to produce an apparently significant
relationship. 1 Banerjee et al. (1993, ch.3, Figs. 3.7, 3.8) consider Yule’s (1926) early
observations on the problem: When two variables are each I(0) and mean-zero identically and
independently distributed, their correlation has a symmetric, nearly Gaussian, frequency
distribution very narrowly centred on the zero mean, while bounded by ± 1. When both
variables are I(1), the distribution is similar to a semi-ellipse with excess frequency at both
1
Phillips (1986) provides a precise characterisation of some of the analytical regression results for integrated
processes.
ends: Any correlation value between ± 0.75 has a similar density, and correlations of ± 0.9
still have a density of above 0.01. Hence standard interpretation of regression results for nonstationary data can be very misleading.
2.2
Principal components analysis of unrelated series
In PCA, we can calculate a set of series as the PCs of N≥2 variables xit (i=1,...,N; t=1,...,T).
First we construct the correlation matrix X'X of the standardised matrix X, which we then
diagonalise as A'X'XA = Θ, where A is the matrix of orthogonal eigenvectors and Θ is the
N*N diagonal matrix of eigenvalues. We can then define W=XA as the T*N matrix of PCs,
whereby each column of W is a T*1 vector of observations for one PC. Each eigenvalue
gives the proportion of the total variance of X explained by the relevant PC, i.e. the R2 of this
PC. Thus if the xit are unrelated, each PC will have equal explanatory power with an R2 of
1/N. If the xit are related to some degree, i.e. there is co-movement between them, the first PC
has the highest R2 and the higher this is, the more the xit are related. Hence if PCA follows
conventional asymptotic theory, then in Monte Carlo simulations of PCAs of unrelated series
the R2 of the first PC should converge to the true value 1/N for t→∞.
If the xit are non-stationary, however, then X’X does not exist asymptotically as it does not
converge on any value. Hence the asymptotic analysis of PCA breaks down. This is exactly
the same problem as for least-squares regression.
3.
Monte Carlo analysis
3.1
PCA simulations for stationary data
We generate the stationary {xit} using the following data generation process: xit = 0.5*xi,t-1 +
uit, where uit ~ IID N(0, 1) ∀ i, t; E(uit ujs) = 0 ∀ t, s, i≠j; E(uit ui,t-k) = 0 ∀ i, k ≠ 0; xi0 = 0 ∀ i.
That is, the {xit} are uncorrelated autoregressive processes with an autoregressive parameter
of 0.5. We then run Monte Carlo simulations of PCAs of the {xit} for various N and T, with
10,000 replications each. At each replication we record the R2 of the first PC. Table 1
summarises the results.
Following section 2.2, for N=2 series we would expect the R2 to converge to 0.5, reflecting
the lack of a relation between the series. As Table 1 shows, this is clearly what we observe.
Moreover, the standard error falls greatly as T gets very large, and the range of the
distribution becomes very small. These results are confirmed as we increase N, and hold for
the standard time series case of N<T as well as for N≥T.
Fig. 1 illustrates that the R2 has a narrow bell-shaped distribution to the right of the true value
0.5 which is the most likely outcome. Fig. 2 exemplifies that for N≥3 the approximately
normal distribution is marginally skewed to the right of the true value, but this disappears as
t→∞.
Hence the simulation results are consistent with conventional asymptotic theory and confirm
that PCA of stationary series produces the standard expected outcomes.
3.2
PCA simulations for non-stationary data
The stationary case analysed in section 3.1 is based on an autoregressive parameter of 0.5.
Further results (not reported due to space constraints) show, as we would expect, that as this
parameter increases between 0 and 1, a larger T is required for any given N in order for the
R2 of the first PC to converge to its true value, while all qualitative results remain upheld.
However, Monte Carlo simulations show that once the autoregressive parameter is equal to 1
and so the {xit} are uncorrelated random walks, PCA, similarly to least-squares regression,
becomes misleading. As reported in Table 2, the simulation mean of the R2 is no longer near
the true value for any given N. Furthermore the range of the distribution is wide in each case,
and it does not even include the true value as N gets large. These problems do not disappear
as T increases. Hence spurious common factors may persist in large samples: PCA of two
completely unrelated but integrated series will tend to suggest an apparent co-movement, i.e.
relationship, between them.
Fig. 3 illustrates that the shape of the distribution to the right of the true value for two I(1)
processes is strikingly similar to that of the correlation between two I(1) processes described
in section 2.1: Almost any possible outcome is attained with similar frequency. This shape of
the distribution is very similar across all T. Fig. 4 exemplifies that for N≥3 the approximately
normal distribution is skewed to the right so much as no longer to include the true value.
Hence if a test statistic based on the R2 of the first PC assumes the distribution to be that for
the stationary case when in fact the correct distribution is that for the non-stationary case
(Figs. 1 vs. 3; Figs. 2 vs. 4; e.g.), the rejection frequency will substantially exceed the
nominal size of the test. There is thus a bias in favour of rejecting the Null hypothesis of no
co-movement even though it is true. This bias does not disappear with increasing T, nor does
the standard error fall, which is different from what would be the case under conventional
asymptotic theory. It is worth noting that the spurious common factor problem for integrated
processes is distinct from inferential problems that may appear for PCA of stationary
processes for small T, as in the stationary case the R2 does converge to its true value, while in
the non-stationary case it does not.
Hence the simulation results show that PCA of non-stationary series produces spurious
principal components that may persist in large samples.
4.
Conclusions
We have conducted Monte Carlo simulations of PCAs of unrelated stationary versus nonstationary series. For stationary series PCA produces the standard outcomes: The R2 of the
first PC converges to its true value of 1/N as t→∞, with a narrow distribution around this. For
non-stationary series these results break down and spurious common factors of unrelated
series may persist in large samples. Drawing inference from PCA of non-stationary series in
the standard fashion therefore may be misleading. We also provide the 95% confidence
intervals from which critical values for the statistically significant rejection of the Null
hypothesis may be calculated. Further resolution of the problem will in the first instance
require a similar analysis for processes of higher and of mixed orders of integration. We leave
this for future research.
References
Banerjee, A., Dolado, J, Galbraith, J.W., Hendry, D.F., 1993. Co-Integration, Error-Correction, and
the Econometric Analysis of Non-Stationary Data. Oxford University Press Inc., New York.
Granger, C.W.J., Newbold, P., 1974. Spurious regressions in econometrics. J. Econom. 2, 111-120.
Phillips, P.C.B., 1986. Understanding spurious regressions in econometrics. J. Econom. 33, 311-340.
Yule, G.U., 1926. Why do we sometimes get nonsense correlations between time series? A study in
sampling and the nature of time series. J. R. Stat. Soc. 89, 1-64.
Fig. 1 Frequency distribution of R2 of first PC of two unrelated I(0) processes, 10,000 time periods
100
80
60
Density
40
20
0
.5
.51
R2
.52
.53
.54
Fig. 2 Frequency distribution of R2 of first PC of three unrelated I(0) processes,10,000 time periods
150
100
Density
50
0
.335
.34
R2
.345
.35
.355
Fig. 3 Frequency distribution of R2 of first PC of two unrelated I(1) processes, 10,000 time periods
2.5
2
1.5
Density
1
.5
0
.5
.6
.7
.8
R2
.9
1
Fig. 4 Frequency distribution of R2 of first PC of three unrelated I(1) processes, 10,000 time periods
4
3
Density
2
1
0
.2
.4
R2
.6
.8
1
Table 1 Monte Carlo results for R2 of first PC of unrelated I(0) processesa
R2
N=2
N=3
N=20
N=50
T=30
T=60
95% Confidence
Interval
Std. Error
(*100)
Min
Max
0.5910
0.5662
0.5897
0.5653
0.5923
0.5672
0.0661
0.0488
0.5000 0.8965
0.5000 0.8313
T=100
0.5510
T=1000 0.5164
T=10000 0.5073
0.5502
0.5161
0.5072
0.5517
0.5166
0.5074
0.0383
0.0122
0.0055
0.5000 0.7359
0.5000 0.5796
0.5000 0.5408
T=5
0.7225
0.7199
0.7252
0.1343
0.5000 0.9994
T=30
T=60
0.4577
0.4210
0.4567
0.4202
0.4588
0.4218
0.0543
0.0395
0.3370 0.7077
0.3350 0.6150
T=100
0.4018
T=1000 0.3547
T=10000 0.3400
0.4012
0.3546
0.3400
0.4024
0.3549
0.3401
0.0304
0.0094
0.0030
0.3338 0.5348
0.3341 0.4053
0.3335 0.3541
T=5
0.6324
0.6302
0.6345
0.1103
0.3484 0.9791
T=30
T=60
0.1790
0.1349
0.1786
0.1347
0.1793
0.1351
0.0181
0.0110
0.1315 0.3011
0.1032 0.1880
T=100
0.1126
T=1000 0.0671
T=10000 0.0551
0.1124
0.0671
0.0551
0.1127
0.0671
0.0551
0.0078
0.0017
0.0005
0.0904 0.1517
0.0620 0.0763
0.0536 0.0573
T=5
0.4214
0.4204
0.4224
0.0498
0.2928 0.6230
T=30
T=60
0.1365
0.0924
0.1362
0.0923
0.1367
0.0925
0.0117
0.0064
0.1050 0.1936
0.0738 0.1290
T=100
0.0715
T=1000 0.0325
T=10000 0.0236
0.0714
0.03253
0.02358
0.0716
0.03256
0.02359
0.0042
0.0007
0.0002
0.0600 0.0909
0.0304 0.0362
0.0231 0.0244
T=5
0.3860
0.3853
0.3867
0.0350
0.2748 0.5554
0.1190
0.0750
0.1188
0.0749
0.1191
0.0751
0.0083
0.0044
0.0932 0.1560
0.0605 0.0999
T=100
0.0549
T=1000 0.0198
T=10000 0.0127
0.0548
0.0198
0.01267
0.0549
0.0198
0.01268
0.0027
0.0004
0.0001
0.0470 0.0677
0.0185 0.0217
0.0124 0.0131
T=5
0.3733
0.3728
0.3739
0.0269
0.2911 0.4958
0.1012
0.0570
0.1012
0.0569
0.1013
0.0570
0.0041
0.0020
0.0877 0.1178
0.0498 0.0650
N=100 T=30
T=60
N=500 T=30
T=60
T=100
0.0376
T=1000 0.0079
T=10000 0.0034
0.03755
0.03760
0.0012
0.00791
0.00792
0.0001
0.0033605 0.0033612 0.00002
0.0339 0.0428
0.0075 0.0084
0.0033 0.0035
T=5
0.3627
0.3633
0.0130
0.3105 0.4083
0.3630
Each row reports the results of a set of simulations with 10,000 replications; R2: mean across
replications.
a
Table 2 Monte Carlo results for R2 of first PC of unrelated I(1) processesa
N=2
N=3
N=20
N=50
R2
95% Confidence
Interval
Std. Error
(*100)
Min
0.7122
0.7106
0.7098
0.7081
0.7147
0.7130
0.1256
0.1250
0.5000 0.9825
0.5000 0.9834
T=100
0.7114
T=1000 0.7103
T=10000 0.7112
0.7089
0.7078
0.7088
0.7138
0.7128
0.7137
0.1253
0.1260
0.1255
0.5001 0.9845
0.5000 0.9867
0.5000 0.9803
T=5
0.7554
0.7526
0.7581
0.1408
0.5000 0.9998
T=30
T=60
0.6225
0.6200
0.6203
0.6178
0.6248
0.6223
0.1155
0.1152
0.3565 0.9600
0.3403 0.9678
T=100
0.6195
T=1000 0.6199
T=10000 0.6211
0.6172
0.6176
0.6189
0.6217
0.6222
0.6234
0.1164
0.1157
0.1161
0.3492 0.9416
0.3507 0.9559
0.3461 0.9473
T=5
0.6762
0.6738
0.6786
0.1225
0.3539 0.9941
T=30
T=60
0.4644
0.4633
0.4631
0.4621
0.4656
0.4646
0.0633
0.0639
0.2664 0.7173
0.2708 0.6978
T=100
0.4635
T=1000 0.4626
T=10000 0.4631
0.4623
0.4614
0.4618
0.4648
0.4639
0.4643
0.0635
0.0639
0.0639
0.2523 0.7223
0.2616 0.7050
0.2486 0.7160
T=5
0.5251
0.5238
0.5264
0.0664
0.3274 0.8225
T=30
T=60
0.4495
0.4488
0.4486
0.4480
0.4503
0.4496
0.0422
0.0416
0.2625 0.6043
0.3109 0.6100
T=100
0.4486
T=1000 0.4483
T=10000 0.4480
0.4477
0.4475
0.4471
0.4494
0.4491
0.4488
0.0417
0.0420
0.0415
0.2809 0.5934
0.2738 0.6103
0.3027 0.6093
T=5
0.5104
0.5095
0.5113
0.0444
0.3366 0.6816
0.4452
0.4445
0.4446
0.4439
0.4458
0.4451
0.0303
0.0301
0.3329 0.5549
0.3249 0.5577
T=100
0.4443
T=1000 0.4435
T=10000 0.4439
0.4437
0.4429
0.4434
0.4449
0.4441
0.4445
0.0299
0.0294
0.0298
0.3268 0.5488
0.3380 0.5524
0.3283 0.5551
T=5
0.5056
0.5050
0.5062
0.0319
0.3919 0.6184
0.4415
0.44026
0.4412 0.4418
0.44000 0.4408
0.0136
0.0135
0.3895 0.4955
0.3810 0.4841
0.44000 0.44053 0.0135
0.43995 0.44048 0.0133
0.44005 0.4406 0.0135
0.3810 0.4841
0.3899 0.5042
0.3757 0.4922
0.5016
0.4481 0.5544
T=30
T=60
N=100 T=30
T=60
N=500 T=30
T=60
T=100
0.44026
T=1000 0.44021
T=10000 0.44032
a
T=5
0.5019
Notes: See Table 1.
0.5022
0.0145
Max
Download