Econometrics Lecture Notes: Monte Carlo simulation In econometrics we frequently wish to explore the properties of an estimator. That is we are not so much interested in economic theory and estimate a relationship between Y and a vector of explanatory variables X. Instead we want to know what happens to our estimator as sample size increases, as the variance of the error term increases as the degree of serial correlation increases. In order to do this we make up the data set and make up an (economic?) relationship which we then proceed to estimate. An example will clarify, within the context of GLS and serial correlation: Step 1: The model we specify as: Yt = 101 + 2.3X1t -0.2X2t + ut (1) Step 2: Specify n (sample size to be 100) Step3: Generate data for X1t. To do this we use a random number generator. That is we program the computer (within, e.g. RATS) to generate 100 numbers which in turn will be X11, X12, …….., X1,100. We need to specify the mean (μ1) and the variance (σ21)of these numbers: X1 ~ N(μ1, σ21). We now use computers to do this, early on it is possible a roulette wheel may have been used, hence the name. Now generate data for X2t, t=1,..,100) using the same approach. You should probably specify a different mean and variance. We could also make the variable trended by e.g. generating Xt2, then calculating X2t = X2t + 0.2*t; where t goes 1,2,3,………,100. Step 4 Generate data for u1t. This takes several stages. We assume that the error term displays first order positive serial correlation: ut = 0.7ut-1 + εt (2) where εt is a pure white noise error term. We need to specify its variance (mean will be zero) and again we use a random number generator where εt ~N(0, σ2ε). Note N denotes the normal distribution, i.e. our error terms will be randomly drawn from a normal distribution with mean zero and variance σ2ε. We tend to use the normal distribution, but you could use others. We now assume that the error term in period 1 u1 was 0 and can now generate the error terms for periods 2 to 100, by inserting the lagged values for ut-1 and the values for εt in equation 2. Excel would do this quite easily. Step 5 We now know everything on the right hand side of equation (1). We can use this to generate values for Yt, for all t. Step 6 Estimate equation (1) using OLS, that is regress Yt on a constant, X1t and X2t. Call this vector of estimator β0OLS, β1OLS, β2OLS for the three coefficients respectively. (Note you will also get estimates for the variance of the error term and the first order correlation coefficient in (2) [in this case 0.7]). Now estimate equation (2) using GLS, where the coefficients will be termed β0GLS, β1GLS, β2GLS, respectively. We could compare these coefficients with the known true values to see which is the closest. But with just one comparison to make luck will play a part. Hence: Step 7 Go back to step 3 and repeat the whole process [some would argue that we should go to step 4, keep the X’s, change just the error term and hence Y]. Now do this a hundred times and calculate: 100 Σ(βj0OLS – 101)/100.0 j=1 100 Σ(βj1OLS – 2.3)/100.0 j=1 100 Σ(βj2OLS – (-0.2))/100.0 j=1 where the superscript j denotes the simulation from 1 to 100. This gives you the average bias on the 100 simulations for each of the three coefficients. For an unbiased estimator it should be close to zero. You should do the same for the GLS estimator. What would you expect to find? In general unbiased estimators (unless you added a lagged dependent variable to equation (1), when together with serial correlation you would get biased estimates when using OLS, but not GLS. However, the GLS estimator should be more efficient, on average it should be closer to the true values. To test this, you would calculate the variance of the 100 different estimates of the OLS estimates for each of the three coefficients and do the same with the GLS estimates. 8. This completes the Monte Carlo process, but you can continue. You could for example, see what happens as you increase the sample size (at step 2), increase the variance of the error term (at step 4), introduce heterogeneous error terms into the model, whereby the error term is correlated with one or more of the X’s. 2SLS. If we were doing a Monte Carlo simulation with a simultaneous equation system and comparing OLS with 2SLS, the methodology would be essentially the same. You would need to generate X values for all the right hand side in all the equations – remember that you need to specify a simultaneous system. You need also to generate values for all the error terms in all the equations. However, because, e.g., Y1 would depend upon Y2 we could not calculate Y1 in step 5 until we know Y2. But similarly we could not generate Y2 until we knew Y1. hence we must calculate the Reduced Form equations, expressing Y1 and Y2 solely in terms of exogenous variables and error terms and calculate Y1 and Y2 that way. This has been done within the context of time series analysis because the basic example was of serial correlation. But the techniques can be applied equally to cross section problems. Econometrics Lecture Notes: Dummy variables & Time Trends Shift Dummy In the regression Yt = β0 + β1X1t + β2X2t + β3D1t, estimated 1955:Q1 to 2003Q4 D1t takes a value of 1 in 1974Q1-Q4 and 1979Q1-Q4, otherwise it takes a value of zero. If significant it implies that in those quarters Y was β3 high(lower if negative) than in the rest of the sample. It implies something happened in those quarters to shift the relationship up or down. The quarters I have chosen coincide roughly with the two oil crises of the 1979. Another example I have used in a paper, was when a civil service strike significantly reduced the number of bankrupt firms, because the Inland revenue and Customs and Excise are major petitioners to the Courts of bankrupt firms. The significance of β3 is therefore a test of a structural change in the periods specified. If D1 had taken the form, D1t =0 for all periods until 1971q1 and then a value of one thereafter and was significant it would imply a permanent shift upwards/downwards of the relationship and be tantamount to testing for a structural break. In effect it changes the constant term. An alternative is called a: Slope Dummy In the regression Yt = β0 + β1X1t + β2X2t + β3X1tD1t, estimated 1955:Q1 to 2003Q4 If significant it implies that the coefficient on X1 was β1 for the periods when D1 was not operative and equal to β1+ β3 in the periods when D1 was operative. It is tremendously tempting at times when you look at a plot of the residuals and see large positive (negative) outliers in say three successive periods or to see what looks like a structural break to specify a dummy variable to pick this up. Some would regard this as a form of data mining, however, if you can find a plausible event to explain these shifts, it seems a case of when the data is informing the theory. But you must have a plausible explanation, simply including a dummy variable because it is significant is not acceptable. Seasonal Dummies When we use quarterly or monthly data we can expect there to be regular movements in the data, based on the different times of the year. (People consume more ice cream and lager in summer and possibly drink more coffee in winter). There are several ways of dealing with this in econometrics. We can use seasonally adjusted data, i.e. take data which (in general someone else) has employed some form of filter on to change the data, taking out of it regular seasonal movements. The problem is that they may have filtered out more than you would wish. Data is a scarce commodity, in general filtering removes part of the signal and the econometrician’s ability to discern the true DGP (data generating process). An alternative is to use seasonal dummies. Thus in the following regression with quarterly data Yt = β0 + β1X1t + β2X2t + β3SD1t + β4SD2t + β5SD3t SD1t, SD2t, SD3t are seasonal dummies, E.g: Quarter ……. ……. 1974Q1 1974Q2 1974Q3 1974Q4 1975Q1 1975Q2 1975Q3 1975Q4 ……. ……. SD1t SD2t SD3t 1 0 0 0 1 0 0 0 0 1 0 0 0 1 0 0 0 0 1 0 0 0 1 0 This indicates when estimated that Yt is β3 higher (lower) in period 1 than in period 4, β4 higher (lower) in period 2 than in period 4 and β5 higher (lower) in period 3 than in period 4. Period or quarter 4 is the reference quarter. None of the dummy variables are operative in period 4. If we had included a fourth seasonal dummy variable then in every quarter one of the dummies would have been 1, that is ΣSDjt = 1 for all t (where the summation is across j and thus from 1 to 4). This coincides with the constant term and the regression would be exactly collinear we would not be able to invert (X’X) to obtain (X’X)-1X’Y. Sometimes computers do give results, but they are nonsense. This is known as ‘the dummy variable trap’. There are other forms, e.g. if we divide people into rural, town and city dwellers and have dummies for all three, we run into the dummy variable trap. There is a further method of dealing with seasonality. If we take the above equation and lag it four periods: Yt-4 = β0 + β1X1t-4 + β2X2t-4 + β3SD1t-4 + β4SD2t-4 + β5SD3t-4 Subtract this from the previous equation: Yt -Yt-4 = β0-β0 + β1[X1t -X1t-4] + β2[X2t -X2t-4] + β3[SD1t -SD1t-4] + β4 [SD2 -SD2t-4] + β5[SD3t - SD3t-4] = β1[X1t -X1t-4] + β2[X2t -X2t-4] That is regressing annual changes of left hand side variable on right hand side variables [with no constant term] removes problem of seasonality. Time Trend In the Cobb Douglas production function: Yt = AKtLtβeγt We can take [natural] logs to base e: lnYt = lnA + lnKt + βlnLt + γt t is an example of a time trend, it increases by 1 every period, 1,2,3,…………….,n. It reflects the impact on the left hand side variable of something which is changing in a steady manner over time and which we cannot otherwise model. In the above case it is the impact of productivity growth and γ is an estimate of productivity growth.