The data box

advertisement
Time Series Lecture
While I am talking, go to the department website and to my home page, find the time series class under
teaching, and download the stata file called ArnoldAgg.dta
The data box
Cases
Time
Variables
Not a very beautiful box, I’m afraid. I used to have software that would let me draw this kind of thing in
minutes, but that was so long ago I have even forgotten what software it was. But I hope this at least gives
you the idea. The data matrix that we can analyse with existing software is generally a two-dimensional
slice through a three-dimensional data box. Either we have cases by variables for one point in time (for
example, a sample survey), or we have cases by time for a single variable, or we have variables by time for
a single case. Today I hope to get to the stage of talking about cross-sectional time-series data which is a
representation of the entire data box: time by cases by variables, but we will start with the simpler and most
common case of variables over time for a single case.
Cross-sectional analysis (which is all most of you will have ever done because that is what we teach in
terms of introductory data analysis) is really a poor man;s data analysis. When we first starded doing
quantitative analysis, cross-sectional analysis was the only type available. When we collected data we
collected it at a particular time, and we got enough variation to be able to get a handle on cause and effect
simply by collecting enough cases. But cross=sectional analysis involves a huge assumption. It assumes
that differences between people are equivalent to changes over time. We get at causation by assuming that
if there is some difference between people that is associated with some other difference then if one of those
differences were to occur as a result of change over time the other difference would also occur.
For example, we assume that because more educated people participate more in politics that if we were to
take some non-participating individual and give them an education this would make them more
participatory. It is quite a brave assumption, but most of what we think we know is based on assumptions
of that kind.
Time series analysis gives one a better grip on causation because one can observe variables actually
changing over time and see what else changes at the same time or right away afterwards.
You might ask why, if time-series analysis is so great, why did we not always do it? The answer is that we
didn’t have the data. Today we do have the data because over the past thirty to forty years we have been
collecting data like crazy and now we have a lot of series that are 30-40 years long – long enough to see
actual causal processes at work. Today, and increasingly as time goes on, it is no longer acceptable to make
the brave assumption that differences between cases are equivalent to changes over time. We are expected
to look at actual changes over time and see what those changes are associated with.
So increasingly time-series analysis has become possible for us and increasingly we are thus expected to do
it.
When we do time-series analysis we walk in the footsteps of economists who walked this path forty years
before us. This is why most of the literature on time series analysis is econometric literature, though in
some ways we are actually moving ahead of economists, as I will be explaining.
Time-series analysis has a number of differences from cross-sectional analysis, and my job today is to
introduce you to the most important of these differences.
1) Fewer cases. A time series is quite long if it has more than 20 cases. I have never seen more than 50.
These would be tiny datasets to social scientists working in the tradition of sociology or political science
cross-sectional research where we often have surveys with thousands of cases. Fewer cases means degrees
of freedom are a problem, so one may be restricted to fewer variables than one would like. Fewer cases
also lead us to look for ways to increase our case base by finding additional examples of cases at each time
point, leading us to cross-sectional time-series.
2) Lack of independence between cases. If we are used to survey analysis we are used to datasets in which
cases are independent of each other by design. With time-series data this independence is almost never
found. Successive cases are related in all sorts of ways, threatening the assumptions underlying the
methods we use for data analysis.
3) Greater sensitivity to model specification. Partly because of the smaller number of cases and their lack
of independence, but also simply because one is looking at change rather than differences, time-series
analysis is really sensitive to model specification. Change one variable, or measure it just slightly
differently, and coefficients change enormously. If you were paying attention to the substance of what we
were doing last Thursday you may have already noticed that. A tiny change, such as removing the two
cases for Northern Ireland, or interacting two variables whose effects are mutually dependent, can make a
palpable differences to findings. This brings about the fourth difference, which is
4) Greater emphasis on theory. If model specification is so important but you are limited in the number of
variables you can employ and the model is really sensitive to how they are measured, then you really have
to think about the theoretical basis of what you are doing. You cannot go on grand fishing expeditions
when you are doing time-series analysis as you can with cross-sectional survey data.
Arguably this is the most beneficial consequence of the shift to time-series data analysis. It makes us more
careful when thinking through and specifying our theoretical expectations. Barefoot empiricism just does
not work with time-series analysis.
Ok. Lets get started by looking at a simple time-series dataset used for the conference paper that I
distributed ahead of time – the ArnoldAgg dataset. It has that name because it was collected by Christine
Arnold, and is an aggregate file with one case per year for the twelve countries that have been members of
the EU since at least 1985.
Double click on the file, and it will open in stata. Click on the scroll icon, to the right and beneath the Data
menu to start logging this session. You will get a dialog box in which you can name the logfile. Call it
‘aggregate time-series’
Click on the data editor (the spreadsheet-like icon under the Window meno) and take a look at the data.
You will see that there is one case per year (though with a missing year in 1972. So this is a classic time
series dataset. You have to tell STATA that it is a time-series dataset by defining a time variable.
Close the data editor by clicking the close box, and type ‘tsset year’
And STATA will tell you there is a gap. This is bad for a number of reasons I wont go into. If the gap is
not substantively important, as it is not here, you need to define another variable that goes up in sequence
without a gap. If you look again at the data editor you will see I have defined another variable next to year
which is called ‘counter’. It simply counts the number of each case. If you close the data editor again you
can type ‘tsset counter’ and stata will accept it without complaint.
Now we are ready to do an analysis. The conference paper I distributed is about the responsiveness of EU
policymakers to public opinion. The idea is that when people want more unification policies, policymakers
respond (perhaps a year or two later). When they want less policymakers also respond:
P = P’ - R*S
Where P is the amount of unification policies being produced, P’ is the preferred level of unification policy
on the part of the European public, R is the relative preference of the European public for more or less
policy, and S is the salience of those policies to public opinion. If you want to understand the ideas
underlying this equation you need to read the paper that I circulated.
In these data, R is represented by the variable ‘unification’ which is a measure of whether people think
there should be more or less than is currently in place. P’ is represented by the variable ‘membership’
which is a measure of whether they think membership in the EU is good for their country, yes or no. This
is the best proxy we could find for the level of unification that people want (some or none). P is measured
by the amount of unification policy being produced (number of directives and regulations promulgated
each year) and S is a measure of salience, for which the only proxy we could find is the average of the
answers to a question about the importance of various issues as subjects for EU legislation.
There is a lot of missing data, especially on the salience question. If you look at the data editor you will see
that ‘meanimp’, the salience measure, is missing more often than not. This could be a problem. Let’s try to
get an idea of the general shape of its evolution over time.
Type ‘graph twoway line meanimp year’
The graph command is reallly complex and you will need to study the help files carefully if you want to
use it. There are all sorts of graphs, and twoway graphs are graphs that have vertical and horizontal axes.
The last variable in the list names the horizontal axis. Here there are just two variables in the list so the first
defines the up and down (Y-axis) scale. You can have many Y-variables if they all have the same scale.
You see that the extent of salience of European issues as measured by the average importance given to
them by respondents to the EB surveys goes up from about 20 in the early years to over 80 in the later
period. But after about 1993 the trend seems to flatten out, as though by that time (following ratification of
the Maastricht treaty with all the publicity this engendered) the EU had become fully salient. Based on this
we can create a proxy measure of salience with a value for every year that starts at 0 in 1970 and increases
in regular intervals to 1 in 1993, which value it retains thereafter.
Type ‘gen nt = (year-1970) / (1993-1970)’
And ‘replace nt=1 if nt>1’
You can graph the new variable over years by typing ‘graph twoway line nt year’
This piece of legerdemain gives us a measure of salience with the same basic shape as the real measure but
with cases for every year. We could also have used interpolation to fill in the gaps in the salience measure
which would have come to about the same thing.
It is important to label any new variables you create, as shown on the screen.
With this variable we can create our measure of salience * relative preferences
Type ‘gen ntuni = nt * unification” and label that new variable
The basic intuition that policy will reflect public opinion is tested if you
type ‘reg policy membership ntuni’
As you see this works quite nicely with over 60% of the variance explained. But what about
heteroskesticity and autogregression? If you don’t know what those words mean this is because you did not
read the handout I sent around ahead of time. Heterodescadicity arises when the dependent variable as very
different amounts of variance at different points in the time series. Autoregression is when the dependent
variable is not only dependent on independent variables but also on itself at previous time-points. We get a
good idea about heterscedasticity by looking at a plot of the dependent variable over time.
Type ‘graph twoway line policy year’
Is there heteroscedasticity there? There might be. But autoregression does not look likely since each time
point does not seem a very good predictor of the next. We can check on these things more accurately if we
go back to the regression analysis.
type ‘reg policy membership ntuni’
Type ‘estat archlm, lag(1 2 3 4)’
The output gives you the significance level of heterskedasticity for the lags you asked for. As you can see,
there is barely significant heteroskedasticity on the first lag, probably because of the big jump in policy
output after the single market act. A few more cases in the data would fix that.
This is what STATA calls a postestimation command. Unfortunately there are separate postestimation
commands for time-series regression and ordinary regression. To find the timeseries postestimators you
have to
Type ‘help regress postestimationts’ which is really a mouthful!
To test for serial autocorrelation, the classic test is the Durbin Watson test.
Type ‘estat dwatson’
In the handout that I circulated ahead of time I told you roughly what the critical values of the Durbin
Watson test were. Anyone remember? 1.5 to 2.5 are probably ok. What do we have here? 1.2. Probably not
ok. WHAT CAN WE DO?
Three ways to deal with this
1) introduce a lagged version of the depvar
2) corrrect the data to remove the dependency
3) try another test and hope it doesn’t come out significant!
To create a lagged version of the dependent variable we do as follows…
Type ‘gen n1policy = policy[_n-1]’ then
Type ‘reg policy n1policy membership ntuni’
You see that lagged policy does indeed appear significant, and that its inclusion almost takes away the
significance of the membership variable. It also greatly reduces the effect of the unification variable. But if
you have a theoretical reason to include past policy this might be a satisfactory result. But there is an
alternative. You can use a procedure that ‘cleans up the errors’ by correcting the data to remove the serial
autocorrelation. To do this you use a type of regression called ‘Prais-Winston regression’. In stata you just
Type ‘prais policy membership ntuni’
This piece of magic transforms the data so that the DW statistic becomes ok, while the strengths of
coefficients are not much changed from what we saw originally.
So if you can get rid of serial autocorrelation with so little a transformation to the data, was the original
autocorrelation really as bad as we imagined on the basis of the DW test? Perhaps we should try a different
test. Repeat the original regression
Type ‘reg policy membership ntuni’
Type ‘estat bgodfrey’
and
This test does not show significant autocorrelation on the first lag, as the DW statistic did. Though the DW
test is still customary, the Bareusch-Godfrey test is much to be preferred. It also tests for autocorrelation
for other than the first lag, by using an option like for archlm
Type ‘estat bgodfrey, lag(1 2 3 4 5)’
See? Ho is not rejected at any number of lags. Forget about the Durbin Watson statistic, which is hard to
evaluate and gives misleading results. It is easyire to compute, but that does not matter to us since we have
STATA do do the calculations.
Close your logfile
TAKE A BREAK !!
In this second part of the class we are going to look at a disaggregated version of the same data, as was
done in the second part of the paper that I handed out. These data are still aggregated from survey data, but
instead of being aggregated to the level of the year they are aggregated to the level of the country within
each year. We distinguish only those countries that have been EC/EU members since 1985, but that still
gives us almost twelve times as much data (not quite because the series is shorter for the two latecomers).
This is called a cross-section time-series dataset. In this case, since the same countries appear year after
year, it is also panel data, so that makes this ‘cross-section time-series panel data’.
Of course, it is not so clear whether all countries will show the same basic relationship as we see for
Europe as a whole, so the purpose of using panel data is not just to get more cases, but also to see if the
relationships found at the higher level of aggregation also hold at the country level.
Close stata if you did not do so before the break.
Now double click on ‘ArnoldCntry.dta’
Open the data editor and look at the data. It starts with a date,… Scroll down…
These data need to be declared as timeseries, so again we need to use tsset, but this time it is a crosssectional time series dataset so we also need to name the country identify, ‘nat’
Type ‘tsset nat counter’
Let’s start by flat-footedly just replicating what we did with the other dataset.
Type ‘reg policy membership ntuni if nat<13’
You will see that the relationship looks very similar to what we saw at the aggregate level. It is not
changed very much if we do something we should have done at the aggregate level but did not, which is to
replace the independent variables by lagged versions of themselves.
Type ‘reg policy n2memb n2ntuni if nat<13’
This is true also at the higher level that we looked at before the break.
With cross-section time-series data it is hard to check for heteroskedasticity. One has to actually save the
residuals from the analysis, and do an analysis of those residuals. This is beyond the remit of this class.
Joost is the resident expert, should you need to do this.
However, there are analyses designed specifically for cross-section timeseries. In stata all of these start
with the letters xt
Type ‘xtreg policy n2memb n2ntuni if nat<13’, fe
The ‘xt’ on the front of ‘reg’ invokes a program that knows about cross sectional data, and the ‘fe’ as an
option at the end of the command forces the program to look only for ‘fixed effects’. What this means is
that the program calculates the country mean for every variable and subtracts the country mean from the
data for each country before doing the analysis. So all that is left is variance over time. You see that the
coefficients are not so different when we enforce an over-time view of the data by optioning fixed effects.
As a handy by-product the xtreg program prints at the foot of its table an indication of whether there was
significant time-serial autocorrelation, given by the rho statistic which has values similar to a correlation
coefficient. In this case you see the value of rho is very small. Also printed is an F-test for significant
autocorrelation which in this case is highly insignificant.
Fixed effects can be obtained in another way as well. You can simply add a list of country dummies to the
variable list, and the effects of these dummies remove any variance attributable to individual countries.
Type reg policy n2memb n2ntuni bel-uk if nat<13’
One of the dummies was dropped. Why is this?
If you compare the coefficients on the substantive variables (membership and unification) you will see they
are EXACTLY the same as when you used xtreg, fe. The constant is a little different, but very little
different and we are not generally much interested in the constant.
This is the means for achieving fixed effects that is most common in the political science literature. It’s
main advantage is that if the country dummies ARE significant you get higher r2. The main disadvantage is
that you get no diagnostics about the extent of serial autocorrelation.
The dummy variables for the different countries all have coefficients, of course, but none of them are
significantly different from 0.
There is something strange about the coefficients for the country dummies, however. What is this?
Lots else we could talk about. Any questions?
Download