The data box

Time Series Lecture While I am talking, go to the department website and to my home page, find the time series class under teaching, and download the stata file called ArnoldAgg.dta The data box Cases Time Variables Not a very beautiful box, I’m afraid. I used to have software that would let me draw this kind of thing in minutes, but that was so long ago I have even forgotten what software it was. But I hope this at least gives you the idea. The data matrix that we can analyse with existing software is generally a two-dimensional slice through a three-dimensional data box. Either we have cases by variables for one point in time (for example, a sample survey), or we have cases by time for a single variable, or we have variables by time for a single case. Today I hope to get to the stage of talking about cross-sectional time-series data which is a representation of the entire data box: time by cases by variables, but we will start with the simpler and most common case of variables over time for a single case. Cross-sectional analysis (which is all most of you will have ever done because that is what we teach in terms of introductory data analysis) is really a poor man;s data analysis. When we first starded doing quantitative analysis, cross-sectional analysis was the only type available. When we collected data we collected it at a particular time, and we got enough variation to be able to get a handle on cause and effect simply by collecting enough cases. But cross=sectional analysis involves a huge assumption. It assumes that differences between people are equivalent to changes over time. We get at causation by assuming that if there is some difference between people that is associated with some other difference then if one of those differences were to occur as a result of change over time the other difference would also occur. For example, we assume that because more educated people participate more in politics that if we were to take some non-participating individual and give them an education this would make them more participatory. It is quite a brave assumption, but most of what we think we know is based on assumptions of that kind. Time series analysis gives one a better grip on causation because one can observe variables actually changing over time and see what else changes at the same time or right away afterwards. You might ask why, if time-series analysis is so great, why did we not always do it? The answer is that we didn’t have the data. Today we do have the data because over the past thirty to forty years we have been collecting data like crazy and now we have a lot of series that are 30-40 years long – long enough to see actual causal processes at work. Today, and increasingly as time goes on, it is no longer acceptable to make the brave assumption that differences between cases are equivalent to changes over time. We are expected to look at actual changes over time and see what those changes are associated with. So increasingly time-series analysis has become possible for us and increasingly we are thus expected to do it. When we do time-series analysis we walk in the footsteps of economists who walked this path forty years before us. This is why most of the literature on time series analysis is econometric literature, though in some ways we are actually moving ahead of economists, as I will be explaining. Time-series analysis has a number of differences from cross-sectional analysis, and my job today is to introduce you to the most important of these differences. 1) Fewer cases. A time series is quite long if it has more than 20 cases. I have never seen more than 50. These would be tiny datasets to social scientists working in the tradition of sociology or political science cross-sectional research where we often have surveys with thousands of cases. Fewer cases means degrees of freedom are a problem, so one may be restricted to fewer variables than one would like. Fewer cases also lead us to look for ways to increase our case base by finding additional examples of cases at each time point, leading us to cross-sectional time-series. 2) Lack of independence between cases. If we are used to survey analysis we are used to datasets in which cases are independent of each other by design. With time-series data this independence is almost never found. Successive cases are related in all sorts of ways, threatening the assumptions underlying the methods we use for data analysis. 3) Greater sensitivity to model specification. Partly because of the smaller number of cases and their lack of independence, but also simply because one is looking at change rather than differences, time-series analysis is really sensitive to model specification. Change one variable, or measure it just slightly differently, and coefficients change enormously. If you were paying attention to the substance of what we were doing last Thursday you may have already noticed that. A tiny change, such as removing the two cases for Northern Ireland, or interacting two variables whose effects are mutually dependent, can make a palpable differences to findings. This brings about the fourth difference, which is 4) Greater emphasis on theory. If model specification is so important but you are limited in the number of variables you can employ and the model is really sensitive to how they are measured, then you really have to think about the theoretical basis of what you are doing. You cannot go on grand fishing expeditions when you are doing time-series analysis as you can with cross-sectional survey data. Arguably this is the most beneficial consequence of the shift to time-series data analysis. It makes us more careful when thinking through and specifying our theoretical expectations. Barefoot empiricism just does not work with time-series analysis. Ok. Lets get started by looking at a simple time-series dataset used for the conference paper that I distributed ahead of time – the ArnoldAgg dataset. It has that name because it was collected by Christine Arnold, and is an aggregate file with one case per year for the twelve countries that have been members of the EU since at least 1985. Double click on the file, and it will open in stata. Click on the scroll icon, to the right and beneath the Data menu to start logging this session. You will get a dialog box in which you can name the logfile. Call it ‘aggregate time-series’ Click on the data editor (the spreadsheet-like icon under the Window meno) and take a look at the data. You will see that there is one case per year (though with a missing year in 1972. So this is a classic time series dataset. You have to tell STATA that it is a time-series dataset by defining a time variable. Close the data editor by clicking the close box, and type ‘tsset year’ And STATA will tell you there is a gap. This is bad for a number of reasons I wont go into. If the gap is not substantively important, as it is not here, you need to define another variable that goes up in sequence without a gap. If you look again at the data editor you will see I have defined another variable next to year which is called ‘counter’. It simply counts the number of each case. If you close the data editor again you can type ‘tsset counter’ and stata will accept it without complaint. Now we are ready to do an analysis. The conference paper I distributed is about the responsiveness of EU policymakers to public opinion. The idea is that when people want more unification policies, policymakers respond (perhaps a year or two later). When they want less policymakers also respond: P = P’ - R*S Where P is the amount of unification policies being produced, P’ is the preferred level of unification policy on the part of the European public, R is the relative preference of the European public for more or less policy, and S is the salience of those policies to public opinion. If you want to understand the ideas underlying this equation you need to read the paper that I circulated. In these data, R is represented by the variable ‘unification’ which is a measure of whether people think there should be more or less than is currently in place. P’ is represented by the variable ‘membership’ which is a measure of whether they think membership in the EU is good for their country, yes or no. This is the best proxy we could find for the level of unification that people want (some or none). P is measured by the amount of unification policy being produced (number of directives and regulations promulgated each year) and S is a measure of salience, for which the only proxy we could find is the average of the answers to a question about the importance of various issues as subjects for EU legislation. There is a lot of missing data, especially on the salience question. If you look at the data editor you will see that ‘meanimp’, the salience measure, is missing more often than not. This could be a problem. Let’s try to get an idea of the general shape of its evolution over time. Type ‘graph twoway line meanimp year’ The graph command is reallly complex and you will need to study the help files carefully if you want to use it. There are all sorts of graphs, and twoway graphs are graphs that have vertical and horizontal axes. The last variable in the list names the horizontal axis. Here there are just two variables in the list so the first defines the up and down (Y-axis) scale. You can have many Y-variables if they all have the same scale. You see that the extent of salience of European issues as measured by the average importance given to them by respondents to the EB surveys goes up from about 20 in the early years to over 80 in the later period. But after about 1993 the trend seems to flatten out, as though by that time (following ratification of the Maastricht treaty with all the publicity this engendered) the EU had become fully salient. Based on this we can create a proxy measure of salience with a value for every year that starts at 0 in 1970 and increases in regular intervals to 1 in 1993, which value it retains thereafter. Type ‘gen nt = (year-1970) / (1993-1970)’ And ‘replace nt=1 if nt>1’ You can graph the new variable over years by typing ‘graph twoway line nt year’ This piece of legerdemain gives us a measure of salience with the same basic shape as the real measure but with cases for every year. We could also have used interpolation to fill in the gaps in the salience measure which would have come to about the same thing. It is important to label any new variables you create, as shown on the screen. With this variable we can create our measure of salience * relative preferences Type ‘gen ntuni = nt * unification” and label that new variable The basic intuition that policy will reflect public opinion is tested if you type ‘reg policy membership ntuni’ As you see this works quite nicely with over 60% of the variance explained. But what about heteroskesticity and autogregression? If you don’t know what those words mean this is because you did not read the handout I sent around ahead of time. Heterodescadicity arises when the dependent variable as very different amounts of variance at different points in the time series. Autoregression is when the dependent variable is not only dependent on independent variables but also on itself at previous time-points. We get a good idea about heterscedasticity by looking at a plot of the dependent variable over time. Type ‘graph twoway line policy year’ Is there heteroscedasticity there? There might be. But autoregression does not look likely since each time point does not seem a very good predictor of the next. We can check on these things more accurately if we go back to the regression analysis. type ‘reg policy membership ntuni’ Type ‘estat archlm, lag(1 2 3 4)’ The output gives you the significance level of heterskedasticity for the lags you asked for. As you can see, there is barely significant heteroskedasticity on the first lag, probably because of the big jump in policy output after the single market act. A few more cases in the data would fix that. This is what STATA calls a postestimation command. Unfortunately there are separate postestimation commands for time-series regression and ordinary regression. To find the timeseries postestimators you have to Type ‘help regress postestimationts’ which is really a mouthful! To test for serial autocorrelation, the classic test is the Durbin Watson test. Type ‘estat dwatson’ In the handout that I circulated ahead of time I told you roughly what the critical values of the Durbin Watson test were. Anyone remember? 1.5 to 2.5 are probably ok. What do we have here? 1.2. Probably not ok. WHAT CAN WE DO? Three ways to deal with this 1) introduce a lagged version of the depvar 2) corrrect the data to remove the dependency 3) try another test and hope it doesn’t come out significant! To create a lagged version of the dependent variable we do as follows… Type ‘gen n1policy = policy[_n-1]’ then Type ‘reg policy n1policy membership ntuni’ You see that lagged policy does indeed appear significant, and that its inclusion almost takes away the significance of the membership variable. It also greatly reduces the effect of the unification variable. But if you have a theoretical reason to include past policy this might be a satisfactory result. But there is an alternative. You can use a procedure that ‘cleans up the errors’ by correcting the data to remove the serial autocorrelation. To do this you use a type of regression called ‘Prais-Winston regression’. In stata you just Type ‘prais policy membership ntuni’ This piece of magic transforms the data so that the DW statistic becomes ok, while the strengths of coefficients are not much changed from what we saw originally. So if you can get rid of serial autocorrelation with so little a transformation to the data, was the original autocorrelation really as bad as we imagined on the basis of the DW test? Perhaps we should try a different test. Repeat the original regression Type ‘reg policy membership ntuni’ Type ‘estat bgodfrey’ and This test does not show significant autocorrelation on the first lag, as the DW statistic did. Though the DW test is still customary, the Bareusch-Godfrey test is much to be preferred. It also tests for autocorrelation for other than the first lag, by using an option like for archlm Type ‘estat bgodfrey, lag(1 2 3 4 5)’ See? Ho is not rejected at any number of lags. Forget about the Durbin Watson statistic, which is hard to evaluate and gives misleading results. It is easyire to compute, but that does not matter to us since we have STATA do do the calculations. Close your logfile TAKE A BREAK !! In this second part of the class we are going to look at a disaggregated version of the same data, as was done in the second part of the paper that I handed out. These data are still aggregated from survey data, but instead of being aggregated to the level of the year they are aggregated to the level of the country within each year. We distinguish only those countries that have been EC/EU members since 1985, but that still gives us almost twelve times as much data (not quite because the series is shorter for the two latecomers). This is called a cross-section time-series dataset. In this case, since the same countries appear year after year, it is also panel data, so that makes this ‘cross-section time-series panel data’. Of course, it is not so clear whether all countries will show the same basic relationship as we see for Europe as a whole, so the purpose of using panel data is not just to get more cases, but also to see if the relationships found at the higher level of aggregation also hold at the country level. Close stata if you did not do so before the break. Now double click on ‘ArnoldCntry.dta’ Open the data editor and look at the data. It starts with a date,… Scroll down… These data need to be declared as timeseries, so again we need to use tsset, but this time it is a crosssectional time series dataset so we also need to name the country identify, ‘nat’ Type ‘tsset nat counter’ Let’s start by flat-footedly just replicating what we did with the other dataset. Type ‘reg policy membership ntuni if nat<13’ You will see that the relationship looks very similar to what we saw at the aggregate level. It is not changed very much if we do something we should have done at the aggregate level but did not, which is to replace the independent variables by lagged versions of themselves. Type ‘reg policy n2memb n2ntuni if nat<13’ This is true also at the higher level that we looked at before the break. With cross-section time-series data it is hard to check for heteroskedasticity. One has to actually save the residuals from the analysis, and do an analysis of those residuals. This is beyond the remit of this class. Joost is the resident expert, should you need to do this. However, there are analyses designed specifically for cross-section timeseries. In stata all of these start with the letters xt Type ‘xtreg policy n2memb n2ntuni if nat<13’, fe The ‘xt’ on the front of ‘reg’ invokes a program that knows about cross sectional data, and the ‘fe’ as an option at the end of the command forces the program to look only for ‘fixed effects’. What this means is that the program calculates the country mean for every variable and subtracts the country mean from the data for each country before doing the analysis. So all that is left is variance over time. You see that the coefficients are not so different when we enforce an over-time view of the data by optioning fixed effects. As a handy by-product the xtreg program prints at the foot of its table an indication of whether there was significant time-serial autocorrelation, given by the rho statistic which has values similar to a correlation coefficient. In this case you see the value of rho is very small. Also printed is an F-test for significant autocorrelation which in this case is highly insignificant. Fixed effects can be obtained in another way as well. You can simply add a list of country dummies to the variable list, and the effects of these dummies remove any variance attributable to individual countries. Type reg policy n2memb n2ntuni bel-uk if nat<13’ One of the dummies was dropped. Why is this? If you compare the coefficients on the substantive variables (membership and unification) you will see they are EXACTLY the same as when you used xtreg, fe. The constant is a little different, but very little different and we are not generally much interested in the constant. This is the means for achieving fixed effects that is most common in the political science literature. It’s main advantage is that if the country dummies ARE significant you get higher r2. The main disadvantage is that you get no diagnostics about the extent of serial autocorrelation. The dummy variables for the different countries all have coefficients, of course, but none of them are significantly different from 0. There is something strange about the coefficients for the country dummies, however. What is this? Lots else we could talk about. Any questions?

The data box

Related documents

Products

Support

The data box

Related documents

Add this document to collection(s)

Add this document to saved

Suggest us how to improve StudyLib