Lectures in Modern Economic Time Series Analysis. 2 ed. c Bo Sjö Linköping, Sweden email:bo.sjo@liu.se October 30, 2011 2 CONTENTS 1 Introduction 1.1 1.2 1.3 7 Outline of this Book/Text/Course/Workshop . . . . . . . . . . . . Why Econometrics? . . . . . . . . . . . . . . . . . . . . . . . . . . Junk Science and Junk Econometrics . . . . . . . . . . . . . . . . . 8 8 9 2 Introduction to Econometric Time Series 2.1 Programs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11 12 I 2.2 Di¤erent types of time series . . . . . . . . . . . . . . . . . . . . . 13 2.3 Repetition - Your First Courses in Statistics and Econometrics . . 15 Basic Statistics 19 3 Time Series Modeling - An Overview 21 3.1 3.2 3.3 3.4 Statistical Models . . . . . . . . . . . . Random Variables . . . . . . . . . . . Moments of random variables . . . . . Popular Distributions in Econometrics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22 23 24 26 3.5 3.6 3.7 3.8 Analysing the Distribution . . . . . . . . . . . . . . . . . Multidimensional Random Variables . . . . . . . . . . . Marginal and Conditional Densities . . . . . . . . . . . . The Linear Regression Model — A General Description . . . . . . . . . . . . . . . . . . . . . . . . 27 29 30 30 4 The Method of Maximum Likelihood 4.1 MLE for a Univariate Process . . . . . . . . . . . . . . . . . . . . . 4.2 MLE for a Linear Combination of Variables . . . . . . . . . . . . . 35 35 38 5 The Classical tests - Wald,LM and LR tests 41 II 43 Time Series Modeling 6 Random Walks, White noise and All That 6.1 Di¤erent types processes . . . . . . . . . . . . 6.2 White Noise . . . . . . . . . . . . . . . . . . . 6.3 The Log Normal Distribution . . . . . . . . . 6.4 The ARIMA Model . . . . . . . . . . . . . . . 6.5 The Random Walk Model . . . . . . . . . . . 6.6 Martingale Processes . . . . . . . . . . . . . . 6.7 Markov Processes . . . . . . . . . . . . . . . . 6.8 Brownian Motions . . . . . . . . . . . . . . . 6.9 Brownian motions and the sum of white noise 6.9.1 6.9.2 CONTENTS . . . . . . . . . 45 45 46 47 47 48 50 52 54 55 The geometric Brownian motion . . . . . . . . . . . . . . . A more formal de…nition . . . . . . . . . . . . . . . . . . . . 56 57 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3 7 Introductioo to Time Series Modeling 59 7.1 Descriptive Tools for Time Series . . . . . . . . . . . . . . . . . . . 62 7.1.1 Weak and Strong Stationarity . . . . . . . . . . . . . . . . . 64 7.1.2 Weak Stationarity, Covariance Stationary and Ergodic Processes 64 7.1.3 Strong Stationarity . . . . . . . . . . . . . . . . . . . . . . . 65 7.1.4 Finding the Optimal Lag Length and Information Criteria . 66 7.2 7.3 7.4 7.5 7.1.5 The Lag Operator . . . . . . . . . . . . . . . . . . . . . . 7.1.6 Generating Functions . . . . . . . . . . . . . . . . . . . . 7.1.7 The Di¤erence Operator . . . . . . . . . . . . . . . . . . . 7.1.8 Filters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7.1.9 Dynamics and Stability . . . . . . . . . . . . . . . . . . . 7.1.10 Fractional Integration . . . . . . . . . . . . . . . . . . . . 7.1.11 Building an ARIMA Model. The Box-Jenkin’s Approach 7.1.12 Is the ARMA model identi…ed? . . . . . . . . . . . . . . . Theoretical Properties of Time Series Models . . . . . . . . . . . 7.2.1 The Principle of Duality . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67 68 69 70 70 71 71 71 72 72 7.2.2 Wold’s decomposition theorem . . . . Additional Topics . . . . . . . . . . . . . . . . 7.3.1 Seasonality . . . . . . . . . . . . . . . 7.3.2 Non-stationarity . . . . . . . . . . . . Aggregation . . . . . . . . . . . . . . . . . . . Overview of Single Equation Dynamic Models . . . . . . 73 75 75 76 76 78 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8 Multipliers and Long-run Solutions of Dynamic Models. 83 9 Vector Autoregressive Models 9.0.1 How estimate a VAR? . . . . . . . . . . . . . . . . . . . . . 9.0.2 Impulse responses in a VAR with non-stationary variables and cointegration. . . . . . . . . . . . . . . . . . . . . . . . 9.1 BVAR, TVAR etc. . . . . . . . . . . . . . . . . . . . . . . . . . . . 85 90 III 93 Granger Non-causality Tests 10 Introduction to Exogeneity and Multicollinearity 10.1 Exogeneity . . . . . . . . . . . . . . . . . . . . 10.1.1 Weak Exogeneity . . . . . . . . . . . . . 10.1.2 Strong Exogeneity . . . . . . . . . . . . 10.1.3 Super Exogeneity . . . . . . . . . . . . . 10.2 Multicollinearity and understanding of multiple 97 . . . . . . . . . . . . . . . . . . . . . . . . . . . . regression. . . . . . . . . . . . . . . . . . . . . . 11 Univariate Tests of The Order of Integration 11.0.1 The DF-test: . . . . . . . . . . . 11.0.2 The ADF-test . . . . . . . . . . . 11.0.3 The Phillips-Perron test . . . . . 11.0.4 The LMSP-test . . . . . . . . . . 11.0.5 The KPSS-test . . . . . . . . . . 11.0.6 The G(p; q) test. . . . . . . . . . 11.1 The Alternative Hypothesis in I(1) Tests 11.2 Fractional Integration . . . . . . . . . . 4 . . . . . . . . . . . . . . . . . . . . . . . . 90 91 97 97 98 99 99 101 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101 102 103 104 104 104 105 106 CONTENTS 12 Non-Stationarity and Co-integration 109 12.0.1 The Spurious Regression Problem . . . . . . . . . . . . . . 110 12.0.2 Integrated Variables and Co-integration . . . . . . . . . . . 111 12.0.3 Approaches to Testing for Co-integration . . . . . . . . . . 112 13 Integrated Variables and Common Trends 117 14 A Deeper Look at Johansen’s Test 121 15 The 15.1 15.2 15.3 15.4 15.5 15.6 15.7 Estimation of Dynamic Models Deterministic Explanatory Variables . . . . . . . . . . . . The Deterministic Trend Model . . . . . . . . . . . . . . . Stochastic Explanatory Variables . . . . . . . . . . . . . . Lagged Dependent Variables . . . . . . . . . . . . . . . . . Lagged Dependent Variables and Autocorrelation . . . . . The Problems of Dependence and the Initial Observation Estimation with Integrated Variables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16 Encompassing 125 125 127 127 129 130 131 133 137 17 ARCH Models 139 17.0.1 Practical Modelling Tips . . . . . . . . . . . . . . . . . . . . 141 17.1 Some ARCH Theory . . . . . . . . . . . . . . . . . . . . . . . . . . 141 17.2 Some Di¤erent Types of ARCH and GARCH Models . . . . . . . . 143 17.3 The Estimation of ARCH models . . . . . . . . . . . . . . . . . . . 146 18 Econometrics and Rational Expectations 18.0.1 Rational v.s. other Types of Expectations . . . 18.0.2 Typical Errors in the Modeling of Expectations 18.0.3 Modeling Rational Expectations . . . . . . . . 18.0.4 Testing Rational Expectations . . . . . . . . . 19 A Research Strategy . . . . . . . . . . . . . . . . . . . . . . . . . . . . 147 147 148 150 150 153 20 References 157 20.1 APPENDIX 1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 158 20.2 Appendix III Operators . . . . . . . . . . . . . . . . . . . . . . . . 160 20.2.1 The Expectations Operator . . . . . . . . . . . . . . . . . . 161 20.2.2 The Variance Operator . . . . . . . . . . . . . . . . . . . . 162 20.2.3 The Covariance Operator . . . . . . . . . . . . . . . . . . . 162 20.2.4 The Sum Operator . . . . . . . . . . . . . . . . . . . . . . . 162 20.2.5 The Plim Operator . . . . . . . . . . . . . . . . . . . . . . . 163 20.2.6 The Lag and the Di¤erence Operators . . . . . . . . . . . . 164 Abstract CONTENTS 5 6 CONTENTS 1. INTRODUCTION ”He who controls the past controls the future.” George Orwell in "1984". Please respect that this is work in progress. It has never been my intention to write a commercial book, or a perfect textbook in time series econometrics. It is simply a collection of lectures in a popular form that can serve as a complement to ordinary textbooks and articles used in education. The parts dealing with tests for unit roots (order of integration) and cointegration are not well developed. These topics have a memo of their own "A Guide to testing for unit roots and cointegration". When I started to put these lecture notes together some years ago I decided on title "Lectures in Modern Time Series Econometrics" because I thought that the contents where a bit "modern" compared to standard econometric textbook. During the fall of 2010 as I started to update the notes I thought that it was time to remove the word "modern" from the title. A quick look in Damodar Gujarati’s textbook "Basic Econometrics" from 2009 convinced my to keep the word "modern" in te title. Gujarati’s text on time series hasn’t changed since the 1970’s even though time series econometrics has changed completely since the 70s. Thus, under these circumstances I see no reason to change the title, at least not yet. There are four ways in which one do time series econometrics. The …rst is to use the approach of the 1970s, view your time series model just like any linear regression, and impose a number of ad hoc restrictions that will hide all problems you …nd. This is not a good approach. This approach is only found in old textbooks and never in today’s research. You might only see it used in very low scienti…c journals. Second, you can use theory to derive a time series model, and interesting parameters, that you then estimate with appropriate estimators. Examples of this ti derive utility functions, assume that agents have rational expectations etc. This is a proper research strategy. However, it typically takes good data, and you need to be original in your approach, but you can get published in good journals. The third, approach is simply to do statistical description of the data series, in the form of a vector autoregressive system, or reduced form of the vector error correction model. This system can used for forecasting, analysing relationships among data series and investigated with respect to unforeseen shocks such as drastic changes in energy prices, money supply etc. The fourth way is to go beyond the vector autoregressive system and try to estimate structural parameters in the form of elasticities and policy intervention parameters. If you forget about the …rst method, the choice depends on the problem at hand and you chose to formulate it. This book aims at telling you how to use methods three and four. The basic thinking is that your data is the real world, theories are abstractions that we use to understand the real world. In applied econometric time series you should always strive to build well-de…ned statistical models, that is models that are consistent with the data chosen. There is a complex statistical theory behind all this, that I will try to popularize in this book. I do not see this book as a substitute for an ordinary textbook. It is simply a complement. INTRODUCTION 7 1.1 Outline of this Book/Text/Course/Workshop This book is intended for people who has done a basic course in statistics and econometrics, either at the undergraduate or at the graduate level. If you did an undergraduate course I assume that you did it well. Econometrics is a type of course were every lecture, and every textbook chapter leads to the next level. The best way to learn econometrics is to be active, read several books, work on your own with econometric software. No teacher can learn you how to run a software. That is something you have to learn on your own by practicing how to use the software. There are some very good software out there, and some The outline di¤erences between graduate and Ph.D. level mainly in the theoretical parts. At the Ph.D. level, there is more stress on theoretical backgrounds. 1) I will begin by talking about why econometrics is di¤erent from statistics, and why econometric time series is di¤erent from the econometrics your meet in many basic textbooks. 2) I will repeat very brie‡y basic statistics, and linear regression and stress what you should know in terms of testing and modeling dynamic models. For most students that will imply going back and do some quick repetition. 3) Introduction into statistical theory including maximum likelihood, random variables, density functions and stochastic processes. 4) Fourth, basic time series properties and processes. 5) Using and understanding ARFIMA and VAR modelling techniques. 6) Testing for non-stationary in the form of stochastic trends, i.e. test for unit roots. 7) The spurious regression problem 8) Testing and understanding cointegration. 9) Testing for Granger non-causality 10) The theory of reduction, exogeneity and building dynamic models and systems 11) Modelling time varying variances, ARCH and GARCH models 12) The implications and consequences of rational expectations on econometric modelling 13) Non-linearities 14) Additional topics For most of these topics I have developed more or less self-instructing exercises. 1.2 Why Econometrics? Why is there a subject called econometrics? Why study econometrics, instead of statistics? Why not let the statisticians teach statistics, and in particular time series techniques? These are common questions, raised during seminars and in private, by students, statisticians and economists. The answer is that each scienti…c area tends to create its own special methodological problems often heavily interrelated with theoretical issues. These problems, and the ways of solving them, are important in a particular area of science but not necessarily in others. Economics is a typical example, were the formulation of the economic and the statistical problem is deeply interrelated from the beginning. In everyday life we are forced to make decisions based on limited information. Most of our decisions deal with the an uncertain stochastic future. We all base our 8 INTRODUCTION decisions on some view of the economy where we assume that certain events are linked to each other in more or less complex ways. Economists call this a model of the economy. We can describe the economy and the behavior of the individuals in terms of multivariate stochastic processes. Decisions based on stochastic sequences play a central role economics and in …nance. Stochastic processes are the basis for our understanding about the behavior of economic agents and of how their behavior determine the future path of the economy. Most econometric text books deal with stochastic time series as a special application of the linear regression technique. Though this approach is acceptable for an introductory course in econometrics, it is unsatisfactory for students with a deeper interest in economics and …nance. To understand the empirical and theoretical work in these areas, it is necessary to understand some of the basic philosophy behind stochastic time series. This work is a work in progress. It is based on my lectures on Modern Economic Time Series Analysis at the Department of Economics …rst at University of Gothenburg and later at University of Skovde and Linköping University in Sweden. The material is not ready for a widespread distribution. This work, most likely, contains lots of errors, some are known by the author, and some are not yet detected. The di¤erent sections do not necessarily follow in a logical order. Therefore, I invite anyone who has opinions about this work to share them me. The …rst part of this work provides a repetition of some basic statistical concepts, which are necessary understanding modern economic time series analysis. The motive for repeating these concepts is that they play a larger role in econometrics than many contemporary textbooks in econometrics indicate. Econometrics did not change much from the …rst edition of Johnston in the 60s until the revised version of Kmenta in the mid 80s. However, as a consequence of the critique against the use of econometrics delivered by Sims, Lucas, Leamer, Hendry and others, in combination with new insights into the behavior of non-stationary time series and the rapid development of computer technology, have revolutionized econometric modeling, and resulted in an explosion of knowledge. The demand for writing a decent thesis, or a scienti…c paper, based on econometric methods has risen far beyond what one can learn in an introductory course in econometrics. 1.3 Junk Science and Junk Econometrics In media you often hear about this and that being proved by scienti…c research. In the late 1990s newspapers told that someone had proved that genetic modi…ed (GM) food could be dangerous. The news were spread quickly, and according to the story the original article had been stooped from being published by scientists with suspicious motives. Various lobby groups immediately jumped up. GM food were dangerous, should be banned and more money should go into this line of research. What had happened was the following. A researcher claimed to have shown that GM food were bad for health. He claimed this results for a number of media people, who distributed the results. (Remember the fuss about ’cold fusion’). The result were presented in a paper sent to a scienti…c journal for publication. The journal however, did not publish the article. It was dismissed because the results were not based on a sound scienti…c method. The researcher had feed rats with potatoes. One group of rats got GM potatoes, the other group of rats got normal non-GM potatoes. The rats that got GM potatoes seemed to develop cancer more often than the control group. The statistical di¤erence JUNK SCIENCE AND JUNK ECONOMETRICS 9 between the groups were not big, but su¢ ciently big for those wanting to con…rm their a priori beliefs that GM food is bad. A somewhat embarrassing detail, never reported in the media, is that rats in general do not like potatoes. As a consequence both groups of rats in this study were su¤ering from starvation, which severely a¤ected the test. It was not possible to determine if the di¤erence between the two groups were caused by starvation, or by GM food. Once the researcher conditioned on the e¤ects of starvation, the di¤erence became insigni…cant. This is an example of ”Junk science”, bad science getting a lot of media exposure because the results …ts the interests of lobby groups, and can be used to scare people. The lesson for econometricians is obvious, if you come up with ”good” results you get rewarded, ”bad” results on the other hand can quickly be forgotten. The GM food example is extreme econometric work. Econometric research seldom get such media coverage, though there are examples such as Sweden’s economic growth is less than other similar countries, the assumed dynamic e¤ects of a reduction of marginal taxes. There are signi…cant results that depend on one single outlier. Once the outlier is removed, the signi…cance is gone, and the whole story behind this particular book is also gone. In these lectures we will argue that the only way to avoid junk econometrics is careful and systematic construction and testing of models. Basically, this is the modern econometric time series approach. Why is this modern, and why stress the idea of testing? The answers are simply that careers have been build on running junk econometric equations, most people are unfamiliar with scienti…c methods in general and the consequences of living in a world surrounded by random variables in particular. 10 INTRODUCTION 2. INTRODUCTION TO ECONOMETRIC TIME SERIES "Time is a great teacher, but unfortunately it kills all its pupils" Louis Hector Berlioz A time series is simply data ordered by time. For an econometrician time series is usually data that is also generated over time in such a way that time can be seen as a driving factor behind the data. Time series analysis is simply approaches that look for regularities in these data ordered by time. In comparison with other academic …elds, the modeling of economic time series is characterized by the following problems, which partly motivates why econometrics is a subject of its own: The empirical sample sizes in economics are generally small, especially compared with many applications in physics or biology. Typical sample sizes ranges between 25 - 100 observations. In many areas anything below 500 observations is considered a small sample. Economic time series are dependent in the sense that they are correlated with other economic time series. In the economic science, problems are almost never concerned with univariate series. Consumption, as an example, is a function of income, and at the same time, consumption also a¤ects income directly and through various other variables. Economic time series are often dependent over time. Many series display high autocorrelation, as well as cross autocorrelation with other variables over time. Economic time series are generally non-stationary. Their means and variances change over time, implying that estimated parameters might follow unknown distributions instead of standard tabulated distributions like the normal distribution. Non-stationarity arises from productivity growth and price in‡ation. Non-stationary economic series appear to be integrated, driven by stochastic trends, perhaps as a result of stochastic changes in the total factor productivity. Integrated variables, and in particular the need to model them, are not that common outside economics. In some situations, therefore, inference in econometrics become quite complicated, and requires the development of new statistical techniques for handling stochastic trends. The concepts of cointegration and common trends, and the recently developed asymptotic theory for integrated variables are examples of this. Economic time series cannot be assumed to be drawn from samples in the way assumed in classical statistics. The classical approach is to start from a population from which a sample is drawn. Since the sampling process can be controlled the variables which make up the sample can be seen as random variables. Hypothesis are then formulated and tested conditionally on the assumption that the random variables have a speci…c distribution. Economic time series are seldom random variables drawn from some underlying population in the classical statistical sense. Observations do not represent INTRODUCTION TO ECONOMETRIC TIME SERIES 11 a random sample in the classical statistical sense, because the econometrician cannot control the sampling process of variables. Variables like, GDP, money, prices and dividends are given from history. To get a di¤erent sample we would have to re-run history, which of course is impossible. The way statistic theory deals with this situation is to reverse the approach taken in classical statistic analysis, and build a model that describes the behavior of the observed data. A model which achieves this is called a well de…ned statistical model, it can be understood as a parsimonious time invariant model with white noise residuals, that makes sense from economic theory. Finally, from the view of economics, the subject of statistics deals mainly with the estimation and inference of covariances only. The econometrician, however, must also give estimated parameters an economic interpretation. This problem cannot always be solved ex post, after the a model has been estimated. When it comes to time series, economic theory is an integrated part of the modeling process. Given a well de…ned statistical model, estimated parameters should represent behavior of economic agents. Many econometric studies fail because researchers assume that their estimates can be given an economic interpretation without considering the statistical properties of the model, or the simple fact there is in general not a one to one correspondence with observed variables and the concepts de…ned in economic theory.1 2.1 Programs Here is a list of statistical software that you should be familiar with, please goggle, (those recommended for time series are marked with *): – *RATS and CATS in RATS, Regression Analysis of Time Series and Cointegrating Analysis of Time Series (www.estima.com) - *PcGive - Comes highly recommended. Included in Oxmetrics modules, see also Timberlake consultants for more programs. - *Gretl (Free GNU license, very good for students in econometrics) - *JMulti (Free for multivariate time series analysis, updated? The discussion forum is quite dead, www.jmulti.com) - *EViews - Gauss (good for simulation) - STATA (used by the World Bank, good for microeconometrics, panel data, OK on time series) - LIMDEP (’Mostly free’ with some editions of Green’s Econometric text book?, you need to pay for duration models?) - SAS - Statistical Analysis System (good for big data sets, but not time series, mainly medicine, "the calculus program for decision makers") - Shazam And more, some are very special programs for this and that, ... but I don’t …nd them worth mentioning in this context. 1 For a recent discussion about the controversies in econometrics see The Economic Journal 1996. 12 INTRODUCTION TO ECONOMETRIC TIME SERIES There is a bunch of software that allows you to program your own models or use other peoples modules: - Matlab - R (Free, GNU license, connects with Gretl) - Ox You should also know about C, C++, and LaTeX to be a good econometrician. Please google. For Data Envelopment Analysis (DEA) I recommend Tom Coelli’s DEAP 2.1 or Paul W. Wilson’s FEAR. 2.2 Di¤erent types of time series Given the general de…nition of time series above, there many types of time series. The focus in econometrics, macroeconomics and …nance is in stochastic time series typically in the time domain, which are non-stationarity in levels but becomes what is called covariance stationary after di¤erencing. In a broad perspective, time series analysis typically aims at making time series more understandable by decomposing them into di¤erent parts. The aim of this introduction is to give a general overview of the subject. A time series is any sequence ordered by time. The sequence can be either deterministic or stochastic. The primary interest in economics is in stochastic time series, where the sequence of observations is made up by the outcome of random variables. A sequence of stochastic variables ordered by time is called a stochastic time series process. The random variables that make up the process can either be discrete random variables, taking on a given set of integer numbers, or be continuous random variables taking on any real number between 1: While discrete random variables are possible they are not that common in economic time series research. Another dimension in modeling time series is to consider processes in discrete time or in continuous time. The principal di¤erence is that stochastic variables in continuous time can take di¤erent values at any time. In a discrete time process, the variables are observed at …xed intervals of time (t), and they do not change between these observation points. Discrete time variables are not common in …nance and economics. There are few, if any variables that remain …xed between their points of observations. The distinction between continuous time and discrete time is not matter of measurability alone. A common mistake is to be confused the fact that economic variables are measured at discrete time intervals. The money stock is generally measured and recorded as an end-of-month value. The way of measuring the stock of money does not imply that it remains unchanged between the observation interval, instead it changes whenever the money market is open. The same holds for variables like production and consumption. These activities take place 24 hours a day, during the whole year. The are measured as the ‡ow of income and consumption over a period, typically a quarter, representing the integral sum of these activities. Usually, a discrete time variable is written with a time subscript (xt ) while continuous time variables written as x(t). The continuous time approach has a number of bene…ts, but the cost and quality of the empirical results seldom motivate the continuous time approach. It is better to use discrete time approaches DIFFERENT TYPES OF TIME SERIES 13 as an approximation to the underlying continuous time system. The cost for doing this simpli…cation is small compared with the complexity of continuous time analysis. This should not be understood as a rejection of all continuous time approaches. Continuous time is good for analyzing a number of well de…ned problems like aggregation over time and individuals. In the end it should lead to a better understanding of adjustment speeds, stability conditions and interactions among economic time series, see Sjöö (1990, 1995).2 In addition, stochastic time series can be analysed in the time domain or in the frequency domain. In the time domain the data is analysed ordered in given time periods such as days, weeks, years etc. The frequency approach decomposes time series into frequencies by using trigonometric functions like sinuses, etc. Spectral analysis is an example of analysis that uses the frequency domain, to identify regularities such as seasonal factors, trends, and systematic lags in adjustment etc. The main advantage with analysing time series in the frequency domain is that it is relatively easy to handle continuous time processes and observations observed as aggregations over time such as consumption. However, in economics and …nance, where we are typically faced with given observations at given frequencies and we seek to study the behavior of agents operating in real time. Under these circumstances, the time domain is the most interesting road ahead because it has a direct intuitive appeal to both economists and policy makers. A dimension in modeling time series is to consider processes in discrete time or in continuous time. The principal di¤erence here is that the stochastic variables in a continuous time process can take on di¤erent values at any time. In a discrete time process, the variables are observed at …xed intervals of time (t), and they are assumed not to change during the frequency interval. Discrete time variables are not common in …nance and economics. There are few, if any variables that remain …xed between their points of observations. The distinction between continuous time and discrete time is not matter of measurability alone. A common mistake is to be confused the fact that economic variables are measured at discrete time intervals. The money stock is generally measured and recorded as an end-of-month value. The way of measuring the stock of money does not imply that it remains unchanged between the observation interval, instead it changes whenever the money market is open. The same holds for variables like production and consumption. These activities take place 24 hours a day, during the whole year. The are measured as the ‡ow of income and consumption over a period, typically a quarter, representing the integral sum of these activities. Our interest is usually in analysing discrete time stochastic processes in the time domain. A time series process is generally indicated with brackets, like fyt g: In some situations it is necessary to be more precise about the length of the process. Writing fyg1 1 indicates that he process start at period one and continues in…nitely. The process consists of random variables because we can view each element in fyt g as a random variable. Let the process go from the integer values 1 up to T: If necessary, to be exact, the …rst variable in the process can be written as yt1 the second variable yt2 etc. up until ytT : The distribution function of the process can then be written as F (yt1 ; yt2 ; :::; ytT ): 2 We can also mention the di¤erent types of series that are used; stocks, ‡ows and price variables. Stocks are variables that can be observed at a point in time like, the money stock, inventories. Flows are variables that can only be observed over some period, like consumption or GDP. In this context price variables include prices, interest rates and similar variables which can be observed at a market at a given point in time. Combining these variables into multivariate process and constructing econometric models from observed variables in discrete time produces further problems, and in general they are quite di¢ cult to solve without using continuous time methods. Usually, careful discrete time models will reduce the problems to a large extent. 14 INTRODUCTION TO ECONOMETRIC TIME SERIES In some situation it is necessary to start from the very beginning. A time series is data ordered by time. A stochastic time series is a set of random variables ordered by time. Let Y~it represent the stochastic variable Y~i given at time t. Observations on this random variable is often indicated as yit . In general terms a stochastic time series is a series of random variables ordered by time. A series starting at time t = 1 and n ending at timeo t = T , consisting of T di¤erent random variables is written as Y~1;1 ; Y~2;2 ; :::Y~T;T . Of course, assuming that the series is built up by individual random variables, with their own independent probability distributions is a complex thought. But, nothing in our de…nition of stochastic time series rules out that the data is made up by completely di¤erent random variables. Sometimes, to understand and …nd solutions to practical problems, it will be necessary to go all the way back to the most basic assumptions. Suppose we are given a time series consisting of yearly observations of interest rates, f6:6; 7:5; 5:9; 5:4; 5:5; 4:5; 4:3; 4:8g, the …rst question to ask is this a stochastic series in the sense that these number were generated by one stochastic process or perhaps several di¤erent stochastic processes? Further questions would be to ask if the process or processes are best represented as continuous or discrete, are the observations independent or dependent? Quite often we will assume that the series are generated by the same identical stochastic process in discrete time. Based on these assumptions the modelling process tries to …nd systematic historical patters and cross-correlations with other variables in the data. All time series methods aim at decomposing the series into separate parts in some way. The standard approach in time series analysis is to decompose as yt = Tt;d + St;d + Ct;d + It ; where Td and Sd represents (deterministic) trend and seasonal components, Ct;d is deterministic cyclical components and I is process representing irregular factors3 . For time series econometrics this de…nition is limited, since the econometrician is highly interested in the irregular component. As an alternative, let fyt g be a stochastic time series process, which is composed as, yt = systematic components + unsystematic components = Td + Ts + Sd + Ss + fyt g + et , (2.1) where the systematic components include deterministic trends Td , stochastic trend Ts ; deterministic seasonals Sd stochastic seasonals Ss , a stationary process (or the short-run dynamics) yt , and …nally a white noise innovation term et : The modeling problem can be described as the problem of identifying the systematic components such that the residual becomes a white noise process. For all series,remember that any inference is potentially wrong, if not all components have been modeled correctly. This is so, regardless of whether we model a simple univariate series with time series techniques, a reduced system, a or a structural model. Inference is only valid for a correctly speci…ed model. 2.3 Repetition - Your First Courses in Statistics and Econometrics 1. To be completed... 3 For simplicity we assume a linear process. An alternative is to assume that the components are multiplicative, xt = Tt;d St;d Ct;d It : REPETITION - YOUR FIRST COURSES IN STATISTICS AND ECONOMETRICS 15 In you …rst course in statistics you learned how to use descriptive statistics; the mean and the variance. Next you learned to calculate the mean and variances from a sample that represents the whole underlying population. For the mean and the variance to work as a description of the underlying population it is necessary to construct the sample in such a way that the di¤erence between the sample mean and the true population mean is non-systematic meaning that the di¤erence between the sample mean and the population is unpredictable. This man that your estimated sample mean is random variable with known characteristics. The most important thing is to construct a sampling mechanism so that the mean calculated from the sample has the characteristics you want to have. That is the estimated mean should be unbiased, e¢ cient and consistent. You learn about random variables, probabilities, distributions functions and frequency distributions. Your …rst course in econometrics "A theory should be as simple as possible, but not simpler" Albert Einstein To be completed... Random variables, OLS, minimize the sum of squares, assumptions 1 - 5(6), understanding, multiple regression, multicollinearity, properties of OLS estimator Matrix algebra Tests and ’solutions’for heteroscedasticity (cross-section), and autocorrelation (time series). If you read a good course you should have learned the three golden rules: test test test, and learned about the probabilities of the OLS estimator. Generalized least squares GLS System estimation: demand and supply models. Further extensions: Panel data, Tobit, Heckit, discrete choice, probit/logit, duration Time series: distributed lag models, partial adjustment models, error correction models, lag structure, stationarity vs. non-stationarity, co-integration What need to know ... What you probably do not know but should know. OLS Ordinary least squares is a common estimation method. Suppose there are two series fyt ; xt g yt = + xt + "t Minimize sample t = 1; :2:::T , PT the sum PTof Squares over the 2 S = t=1 "2t = t=1 (yt xt ) Take the derivative of S with respect to and , set the expressions to zero, and solve for and : S = S = ^ =s T SS = ESS + RSS RSS 1 = ESS T SS + T SS RSS 2 R = 1 T SS = ESS T SS Basic assumptions 1) E("t ) = 0 for all t 16 INTRODUCTION TO ECONOMETRIC TIME SERIES 2) 3) 4) 5) E("t )2 = 2 for all t E("t "t k ) = 0 for all k 6= t E(Xt "t ) = 0 E(X 0 X) 6= 0 6) "t s N ID(0; 2 ) Discuss these properties Properties Gauss-Markow BLUE Deviations Misspeci…cation, add extra variable, forget relevant variable Multicollinearity Error in variables problem Homoscedasticity Heteroscedasticity Autocorrelation REPETITION - YOUR FIRST COURSES IN STATISTICS AND ECONOMETRICS 17 18 INTRODUCTION TO ECONOMETRIC TIME SERIES Part I Basic Statistics 19 3. TIME SERIES MODELING - AN OVERVIEW Economists are generally interested in a small part of what is normally included in the subject ”Time Series Analysis”. Various techniques such as …ltering, smoothing and interpolation developed for deterministic time series are of relative minor interest for economists. Time series econometrics is more focused on the stochastic part of time series. The following is an brief overview of time series modeling, from an econometric perspective. It is not text book in mathematical statistics, nor is the ambition to be extremely rigorous in the presentation of statistical concepts. The aim more to be a guide for the yet not so informed economist who wants to know more about the statistical concepts behind time series econometrics. When approaching time series econometrics the statistical vocabulary quickly increases and can become overwhelming. These …rst two chapters seek to make it possible for people without deeper knowledge in mathematical statistics to read and follow the econometric and …nancial time series literature. A time series is simply a set of observations ordered by time. Time series techniques seeks to decompose this ordered series into di¤erent components, which in turn can be used to generate forecasts, learn about the dynamics of the series, and how it relates to other series. There is a number of dimensions and decision to keep account of when approaching this subject. First, the series, or the process, can be univariate or multivariate, depending on the problem at hand. Second, the series can be stochastic or purely deterministic. In the former case a stochastic random process is generating the observations. Third, given that the series is stochastic, with perhaps deterministic components, it can be modeled in the time domain or in the frequency domain. Modeling in the frequency domain implies describing the series in terms cosines functions of di¤erent wave lengths. This is a useful approach for solving some problems, but not a general approach for economic time series modeling. Fourth, the data generating process and the statistical model can constructed in continuous or discrete time. Continuous time econometrics is good for some problems but not all. In general it leads to more complex models. A discrete time approach builds on the assumption that the observed data is unchanged between the intervals of observation. This is a convenient approximation, that makes modeling easier, but comes at a cost in the form of aggregation biases. However, in the general case, this is a low cost, compared with the costs of general misspeci…cation. A special chapter deals with the discussion of discrete versus continuous time modeling. The typical economic time series is a discrete stochastic process modeled in the time domain. Time series can be modelled by smoothing and …lter techniques. For economists these techniques are generally uninteresting, though we will brie‡y come back to the concept of …lters. The simplest way to model an economic time series is to use autoregressive techniques, or ARIMA techniques in the general case. Most economic time series, however, are better modeled as a part of a multivariate stochastic process. Economic theory systems of economic variables, leading to single equation transfer functions and systems of equations in a VAR model. These techniques are descriptive, they do not identify structural, or ”deep parameters”like elasticities, marginal propensities to consume etc. The estimate more TIME SERIES MODELING - AN OVERVIEW 21 speci…c economic models, we turn to techniques as VECM, SVAR, and structural VECM. What is outlined above is quite di¤erent from the typical basic econometric textbook approach, which starts with OLS and ends in practice with GLS as the solution to all problems. Here we will develop methods, which …rst describes the statistical properties of the (joint) series at hand, and then allows the researcher to answer economic questions in such a way that the conclusions are statistically and economically valid. To get there we have to start with some basic statistics. 3.1 Statistical Models A general de…nition of statistical time series analysis is that it …nds a mathematical model that links observed variables with the stochastic mechanism that generated the data. This sounds abstract, but the purpose of this abstraction is understand the analytical tools of time series statistics. The practical problem is the following; we have some stochastic observations over time. We know that these observations have been generated by a process, but we do not know what this process looks like. Statistical time series analysis is about developing the tools needed to mimic the unknown data generating function (DGP). We can formulate some general features of the model. First, it should be “a well-de…ned statistical model”in the sense that the assumptions behind the model should be valid for the data chosen. Later we will de…ne more exactly what this implies for an econometric model. For the time being, we can say that single most important criteria of models is that the residuals should be a white noise process. Second, the parameters of the model should be stable over time. Third, the model should be simple, or parsimonious, meaning that its functional form should be simple. Fourth, the model should be parameterized in such a way that it is possible to give the parameters a clear interpretation and identify them with events in the real world. Finally, the model should be able to explain other rival models describing the dependent variable(s). The way to build a “well-de…ned-statistical-model”is to investigate the underlying assumptions of the model in a systematic way. It can easily be shown that t-values, R2 , and Durbin-Watson values are not su¢ cient for determining the …t of a model. In later chapters we will introduce a systematic test procedure. The …nal aim of econometric modelling is to learn about economic behavior. To some extent this always implies using some a priori knowledge about in the form of theoretical relationships. Economists, in general, have extremely strong a priori belief about the size and sign of certain parameters. This way of thinking has lead to much confusion, because a priori believes can be driven too far. Econometrics is basically about measuring correlations. It is a common misunderstanding among non-econometricians that correlations can be too high or too low, or be deemed right or wrong. Measured correlations are the outcome of the data used, only. Anyone who thinks of an estimated correlation as “wrong”, must also explain what went wrong in the estimation process, which requires knowledge of econometrics and the real world. 22 TIME SERIES MODELING - AN OVERVIEW 3.2 Random Variables The basic reason for dealing with stochastic models rather than deterministic models is that we are faced with random variables. A popular de…nition of random variables goes like this: a random variable is a variable that can take on more than one value. 1 For every possible value that a random variable can take on there is a number between zero and one that describes the probability that the random variable will take on this value. In the following a random variable is indicated with . In statistical terms, a random variable is associated with the outcome of a statistical experiment. All possible outcomes of such an experiment can be called ~ the sample space. If S is a sample space with a probability measure and if X ~ is real valued function de…ned over S then X is called a random variable. There are two types of random variables; discrete random variables, which only take on a speci…c number of real values, and (absolute) continuous random variables, which can take on any value between 1. It is also possible to examine discontinuous random variables, but we will limit ourselves to the …rst two types. ~ can take k numbers of values (x1 , ..., xk ), If the discrete random variable X the probability of observing a value xj can be stated as, P (xj ) = pj : (3.1) Since probabilities of discrete random variables are additive, the probability of observing one of the k possible outcomes is equal to 1.0, or using the notation just introduced, P (x1 ; x2 ; :::; or xk ) = p1 + p2 + ::: + pk = 1: (3.2) A discrete random variable is described by its probability function, F (xi ), ~ takes on a certain value. (The term which speci…es the probability with which X cumulative distribution is used synonymous with probability function). In time series econometrics we are in most applications dealing with continuous random variables. Unlike discrete variables, it is not possible to associate a speci…c observation with a certain probability, since these variables can take on an in…nite range of numbers. The probability that a continuous random variable will take on a certain value is always zero. Because it is continuous we cannot make a di¤erence between 1.01 and 1.0101 etc. This does not mean that the variables do not take on speci…c values. The outcome of the experiment, or the observation, is of course always a given number. Thus, for a continuous random variable, statements of the probability of an ~ observation must be made in terms of the probability that the random variable X is less than or equal to some speci…c value. We express this with the distribution ~ as follows, function F (x) of the random variable X ~ F (x) = P (X x) f or 1 < x < 1; (3.3) ~ taking a value less than or equal to x. which states the probability of X The continuous analogue of the probability function is called the density function f (x), which we get by derivation of the distribution function, w:r:t the observations (x), dF (x) = f (x): (3.4) dx 1 Random variables (RV:s) are also called stochastic variables, chance variables, or variates. RANDOM VARIABLES 23 The fundamental theorem of integral calculus gives us the following expression ~ takes on a value less that or equal to x, for the probability that X Z x F (x) = f (u)du: (3.5) 1 It follows that for any two constants (a) and (b), with a < b, the probability ~ takes on a value on the interval from (a) to (b) is given by that X F (b) F (a) = = Z Z b f (u)du 1 b Z a f (u)du (3.6) 1 f (u)du (3.7) a The term density function is used in a way that is analogous to density in physics. Think of a rod of variable density, measured by the function f (x). To obtain the weight of some given length of this rod, we would have to integrate its density function over that particular part in which we are interested. Random variables care described by their density function and/or by their moments; the mean, the variance etc. Given the density function, the moments can be determined exactly. In statistical work, we must …rst estimate the moments, from the moments we can learn about density function. For, instance we can test, if the assumption of an underlying normal density function is consistent with the observed data. A random variable can be predicted, in other words it is possible to form an expectation of its outcome based on its density function. Appendix III deals with the expectations operator and other operators related to random variables. 3.3 Moments of random variables Random variables are characterized by their probability density functions pdf : s) or their moments. In the previous section we introduced pdf : s: Moments refers to measurements such as the mean, the variance, skewness, etc. If we know the exact density function of a random variable then we would also know the moments. In applied work, we will typically …rst calculate the moments from a sample, and from the moments …gure out the density function of variables. The term moment originates from physics and the moment of a pendulum. For our purposes it can be though of as a general term which includes the de…nition of concepts like the mean and the variance, without referring to any speci…c distribution. Starting with the …rst moment, the mathematical expectation of a discrete random variable is given by, ~ = E(X) X xf (x) (3.8) where E is the expectation operator and f (x) is the value of its probability ~ Thus, E(X) ~ represents the mean of the discrete random variable function at X. ~ X: Or, in other words, the …rst moment of the random variable. For a continuous ~ the mathematical expectation is random variable (X), Z 1 ~ = x f (x)dx (3.9) E(X) 1 24 TIME SERIES MODELING - AN OVERVIEW where f (x) is the value of its probability density at x. The …rst moment can also be referred to as the location of the random variable. Location is a more generic concept than the …rst moment or the mean. The term moments are used in situations where we are interested in the expected value of a function of a random variable, rather than the expectation of the speci…c variable itself. Say that we are interested in Y~ , whose values are related ~ by the equation y = g(x). The expectation of Y~ is equal to the expectation to X of g(x), since E(Y~ ) = E [g(x)]. In the continuous case this leads to, Z 1 ~ = g(x)f (x)dx: (3.10) E(Y~ ) = E[g(X)] 1 Like density, the term moment, or moment about the origin, has its explanation in physics. (In physics the length of a lever arm is measured as the distance from the origin. Or if we refer to the example with the rod above, the …rst moment around the mean would correspond to horizontal center of gravity of the rod.) Reasoning from intuition, the mean can be seen as the midpoint of the limits of the density. The midpoint can be scaled in such a way that its becomes the origin of the x- axis. The term ”moments of a random variable” is a more general way of talking about the mean and variance of a variable. Setting g(x) equal to x, we get the r:th moment around the origin, X 0 ~r xr f (x) (3.11) r = E(X ) = ~ is a discrete variable. In the continuous case we get, when X Z 1 0 ~r xr f (x)dx: r = E(X ) = (3.12) 1 ~ The …rst moment is nothing else than the mean, or the expected value of X. The second moment is the variance. Higher moments give additional information about the distribution and density functions of random variables. 0 ~ = (X ~ Now, de…ning g(X) r ) we get what is called the r:th moment about ~ For r = 0, 1, 2, 3 ... we the mean of the distribution of the random variable X. get for a discrete variable, X 0 r 0 r ~ ~ (X (3.13) r = E[(X r) ] = r ) f (x) ~ is continuous and when X r ~ = E[(X 0 r r) ] = Z 1 ~ (X 0 r ) f (x)dx: (3.14) 1 The second moment about the mean, also called the second central moment, is nothing else than the variance of g(x) = x; Z 1 ~ ~ E(X)] ~ 2 f (x)dx var(X) = [X (3.15) 1 Z 1 ~ 2 f (x)dx [E(X)] ~ 2 = X (3.16) 1 = ~ 2) E(X ~ 2; [E(X)] (3.17) where f (x) is the value of probability density function of the random variable ~ at x:A more generic expression for the variance is dispersion. We can say that X MOMENTS OF RANDOM VARIABLES 25 the second moment, or the variance, is a measure of dispersion, in the same way as the mean is a measure of location. The third moment, r = 3, measures asymmetry around the mean, referred to as skewness. The normal distribution is asymmetric around the mean. The likelihood of observing a value above or below the mean is the same for a normal distribution. For a right skewed distribution, the likelihood of observing a value higher than the mean is higher than observing a lower value. For a left skewed distribution, the likelihood of observing a value below the mean is higher than observing a value above the mean. The fourth moment, referred to as kurtosis, measures the thickness of the tails of the distribution. A distribution with thicker tails than the normal, is characterized by a higher likelihood of extreme events compared with the normal distribution. Higher moments give further information about the skewness, tails and the peak of the distribution. The …fth, the seventh moments etc. give more information about the skewness. Even moments, above four, give further information the thickness of the tails and the peak. 3.4 Popular Distributions in Econometrics In time series econometrics, and …nancial economics, there is a small set of distributions that one has to know. The following is a list of common distributions: Distribution Normal distribution N ; 2 Log Normal distribution LogN ; 2 Student t distribution St ; ; 2 Cauchy distribution Ca ; 2 Gamma distribution Ga ; ; 2 Chi-square distribution ( ) F distribution F (d1 ; d2 ) Poisson distribution P ois ( ) Uniform distribution U (ja; bj) The pdf of a normal distribution is written as p 1 (x )2 2 2 : 2 2 The normal distribution characterized by the following: the distribution is symmetric around its mean, and it has only two moments, the mean and the variance, N ( ; 2 ). The normal distribution can be standardised to have a mean of zero and variance of unity (say ( x E(x)) and is consequently called a standardised normal distribution, N (0; 1). In addition, it follows that the …rst four moments, the mean, the variance, the ~ = , V ar(X) ~ = 2 ; Sk(X) ~ = 0;and Ku(X) ~ = skewness and kurtosis, are E(X) 3:There are random variables that are not normal by themselves but becomes normal if they are logged. The typical examples are stock prices and various macroeconomic variables. Let St be a stock price. The dollar return over a given interval, Rt = St St 1 is not likely to be normally distributed due to simple fact that the stock price is raising over time, partly due to the fact that investors demand a return on their investment but mostly due to in‡ation. However, if you take the log of the stock price and calculate the per cent return (approximately), f (x) 26 e TIME SERIES MODELING - AN OVERVIEW rt = ln St ln St 1 , this variable are much more likely to have a normal distribution (or a distribution that can be approximated with a normal distribution). Thus, since you have taken logs of variables in your econometric models, you have already worked with log normal variables. Knowledge about log normal distributions is necessary if you want to model, or better understand, the movements of actual stock prices and dollar returns. The Student t distribution is similar to the normal distribution, it is symmetric around the mean, it has a variance but has thicker tail than the normal distribution. The Student t distribution is described by ; ; 2 where refers to the mean and 2 refers to the variance. The parameter is called the degrees of freedom of the Student t distribution and refers to the thickness of tails. A random variable that follows a Student t distribution will converge to a normal random variable as the number of observations goes to in…nity. The Cauchy distribution is related to the normal distribution and the Student t distribution. Compared with the normal it is symmetric and has two moments, but it has fatter tails and is therefore better suited for modelling random variables which takes on relatively more extreme events than the normal. The set back for empirical work is that higher moment are not de…ned meaning that it is di¢ cult to use empirical moments to test for Cauchy distribution against say the normal or the Student t distribution. The gamma and the chi-square distributions are related to variances n of normal o random variables. If we have a set of normal random variables Y~1 ; Y~2 :::; Y~v ~ and for a new variable as X Y~12 + Y~22 + ::: + Y~v2 , then this new variable will ~ have a gamma distribution as X Ga( ; ; 2 ):A special case of the gamma distribution is when we have = 0 and 2 = 1, the distribution is then called a chi-square distribution 2 ( ) with degrees of freedom. Thus, take the square of an estimated regression parameter and divide it with it variance and you get a 2 chi-square distributed test for signi…cance of the estimated , ( ^ = ^ ) ( ): The F distribution comes about when you compare the ration (or log di¤erence) of two squared normal random variables. The Poisson distribution is used to model jumps in the data, usually in combination with a geometric Brownian motions, (jump di¤usion models). The typical example is stock prices that might move up or down drastically. The parameter measures the probability of jump in the data. 3.5 Analysing the Distribution In practical work we need to know the empirical distribution of the variables we are working with, in order to make any inference. All empirical distributions can analysed with the help of their …rst four moments. Through the …rst four moments we get information …rst about the mean and the variance and second about the skewness and kurtosis. The latter moments are often critical when we decide if a certain empirical distribution should be seen as normal or at least approximately normal. It is, of course, extremely convenient to work with the assumption of a normal distribution, since a normal distribution is described by its …rst two moments only. In …nance, the expected return is given be the mean, and the risk of the asset is given by its variance. An approximation to the holding period return of an asset is the log di¤erence of its price. In the case of a normal distribution, there is no need to consider higher moments. Furthermore, linear combinations of ANALYSING THE DISTRIBUTION 27 normal variates result in new normally distributed variables. In econometric work, building regression equations, the residual process is assumed to be a normally independent white noise process, in order to allow for inference and testing. It is by calculating the sample moments we learn about the distribution of the series at hand. The most typical problem in empirical work is to investigate how well the distribution a variable can be approximated with a normal distribution. If the normal distribution is rejected for the residuals in a regression, the typical conclusion is that there something important missing in the regression equation. The missing part is either an important explanatory variable, or the direct cause of an outlier. To investigate the empirical distribution we need to calculate the sample moments of the variable. The sample mean, of fxt g = fx1 ; x2 ; :::xT g; can be estiPT mated as ^ x = x = (1=T ) t=1 xt . Higher moments can be estimated with the PT formula mr = (1=T ) t=1 (xt x)r :2 ~ t N ( x ; 2x ); subtracting the mean and diA series is normally distributed, X viding with the standard error lead to a standardised normal variable, distributed ~ N (0; 1): For a standardised normal variable the third and fourth moments as X equal 0 and 3, respectively. The standardised third moment is now as Skewness, given as b1 = m23 =m32 . A skewness with a negative value indicates a left skew distribution, compared with the normal. If the series is the return on an asset it means that ’bad’or negative surprises dominates over ’good’positive surprises. A positive value of skewness implies a right skewed distribution. In terms of asset returns, ’good’or positive surprises are more likely than ’bad’negative surprises. The fourth moment, kurtosis is calculated as b2 = m4 =m22 : A value above 3, implies that the distribution generates more extreme values than the normal distribution. The distribution has fatter tails than the normal. Referring to asset returns, approximating the distribution with the normal, would underestimate the risk associated with the asset. An asymptotic test, with a null of a normal distribution is given by3 , JB = T m23 =m32 6 [(m4 =m22 ) 24 3]2 +T 3m21 m1 m3 + 2m2 m22 2 (2): This test is known as the Jarque-Bera (JB) test and is the most common test for normality in regression analysis. The null hypothesis is that the series is normally distributed. Let 1 ; 2 , 3 and 4 represent the mean, the variance, the skewness and the kurtosis. The null of a normal distribution is rejected if the test statistics is signi…cant. The fact that the test is only valid asymptotically, means that we do not know the reason for a rejection in a limited sample. In a less than asymptotic sample rejection of normality is often caused by outliers. If we think the most extreme value(s) in the sample are non-typical outliers, ’removing’them from the calculation the sample moments usually results in a non-signi…cant JB test. Removing outliers is add hoc. It could be that these outliers are typical values of the true underlying distribution. 2 For these moments to be meaningful, the series must be stationary. Also, we would like fxt g to an independent process. Finally, notice that the here suggested estimators of the higher moments are not necessarily e¢ cient estimators. 3 This test statistics is for a variable with a non-zero mean. If the variable is adjusted for its mean (say an estimated residual), the second should be removed from the expression. 28 TIME SERIES MODELING - AN OVERVIEW 3.6 Multidimensional Random Variables We will now generalize the work of the previous sections by considering a vector of n random variables, ~ = (X ~1; X ~ 2 ; :::; X ~n) X (3.18) whose elements are continuous random variables with density functions f (x1 ) ..., f (xn ), and distribution functions F (x1 ) ..., F (xn ). The joint distribution will look like, F (x1 ; x2 ; :::; xn ) = Z xn 1 Z x1 f (x1 ; x2 ; :::; xn )dx1 dxp ; (3.19) 1 where f (x1 , x2 , ..., xn ) is the joint density function. If these random variables are independent, it will be possible to write their joint density as the product of their univariate densities, f (x1 ; x2 ; :::; xn ) = f (x1 )f (x2 ) f (xn ): (3.20) For independent random variables we can de…ne the r:th product moment as, = ~ 1 r1 ; X ~ 2 r2 ; :::; X ~ n rn ) E(X Z 1 Z 1 x1 r1 x2 r2 1 (3.21) xn rn f (x1 ; x2 ; :::; xn )dx1 dx2 dxn ; (3.22) 1 which, if the variables are independent, factorizes into the product ~ 1 r1 )E(X ~ 2 r2 ) E(X ~ n rn ): E(X (3.23) It follows from this result that the variance of a sum of independent random variables is merely the sum of these individual variances, ~1 + X ~ 2 + ::: + X ~ n ) = var(X ~ 1 ) + var(X ~ 2 ) + ::: + var(X ~ n ): var(X (3.24) We can extend the discussion of covariance to linear combinations of random variables, say ~ = a1 X ~ 1 + a2 X ~ 2 + ::: + ap X ~p; a0 X (3.25) which leads to, ~ = cov(a0 X) p X p X ai aj ij : (3.26) i=1 j=1 ~ Z• = B X, ~ and the These results hold for matrices as P well. If we have Y~ = AX, ~ ~ covariance matrix between X and Y ( ), we have also that, cov(Y• ; Y• ) = A • Z) • =B cov(Z; and • =A cov(Y• ; Z) MULTIDIMENSIONAL RANDOM VARIABLES X X X A0 ; (3.27) B0; (3.28) B0: (3.29) 29 3.7 Marginal and Conditional Densities Given a joint density function of n random variables, the joint probability of a subsample of them is called the joint marginal density. We can also talk about joint marginal distribution functions. If we set n = 3 we get the joint density function f (x1 , x2 , x3 ). Given the marginal distribution g(x2 x3 ), the conditional ~ 1 , given that the random probability density function of the random variable X ~ ~ variables X2 and X3 takes on the values x2 and x3 is de…ned as, '(x1 j x2 ; x3 ) = f (x1 ; x2 ; x2 ) ; g(x2 ; x3 ) (3.30) or f (x1 ; x2 ; x3 ) = '(x1 j x2 ; x3 )g(x2 x3 ): (3.31) ~1, Of course we can de…ne a conditional density for various combinations of X ~ ~ X2 and X3 , like, p(x1 , x3 ; j x2 ) or g(x3 j x1 , x2 ). And, instead of three di¤erent variables we can talk about the density function for one random variable, say Y~t , for which we have a sample of T observations. If all observations are independent we get, f (y1 ; y2 ; :::; yt ) = f (y1 )f (y2 ):::f (yt ): (3.32) Like before we can also look at conditional densities, like f (yt j y1 ; y2 ; :::; yt 1 ); (3.33) which in this case would mean that (yt ) the observation at time t is dependent on all earlier observations on Y~t . It is seldom that we deal with independent variables when modeling economic time series. For example, a simple …rst order autoregressive model like yt = yt 1 + t , implies dependence between the observations. The same holds for all time series models. Despite this shortcoming, density functions with independent random variables, are still good tools for describing time series modelling, because the results based on independent variables carries over to dependent variables in almost every case. 3.8 The Linear Regression Model — A General Description In this section we look at the linear regression model starting from two random ~ Two regressions can be formulated, variables Y~ and X. y= + x+ ; (3.34) x= + y+ : (3.35) and Whether one chooses to condition y on x, or x on y depends on the parameter of interest. In the following it is shown how these regression expression are constructed from the correlation between x and y, and their …rst moments by making use of the (bivariate) joint density function of x and y. (One can view this section as an exercise in using density functions). 30 TIME SERIES MODELING - AN OVERVIEW Without explicitly stating what the density function looks like, we will assume ~ that we know the joint density function for the two random variables Y~ and X, and want to estimate a set of parameters, and . Hence we got, the joint density, D(y; x; ); (3.36) where is a vector of parameters which describes the relation between Y~ and ~ To get the linear regression model above we have condition on the outcome of X. ~ X; D(y; x; ) = D(yj x; ); (3.37) where represents the vector of parameters of interest = [ , ]. This operation requires, that the parameters of interest can be written as a function of the parameters in the joint distribution function, = f ( ). ~ is, equation 1 The expected mean of Y~ for given X Z E(Y~ j x; ) = y D(yj x; )dy = + x; (3.38) or if we choose to condition on Y~ instead, Z ~ E(Xj y; ) = x D(xj y; )dx = + x: The parameters in 3.38 can be estimated by using means, variances and covariances of the variables. Or in other terms, by using some of the lower moments ~ and Y~ . Hence, the …rst step rewrite 3.38 in such a of the joint distribution of X ~ and Y~ . way that we can write and in terms of the means of X Looking at the LHS of 3.38 it can be seen that a multiplication of the condi~ g(x), leads to the joint density. tional density with the marginal density for X, Given the joint density we can choose to integrate out either x or y. In this case we chose to integrate over x. Thus we have after multiplication, Z y D(yj x; )dyg(x)= g(x ) + x g(x ): (3.39) Integrating over x leads to, at the LHS, Z Z yD(yjx; )dydg(x ) Z Z = yD(y;xj )dydxg(x ) Z = yD(yj ) = E(yj ) = y : (3.40) Performing the same operations on the RHS leads to, Z Z g(x)dx + x g(x)dx = ~ = E(X) + + x: (3.41) If we put the two sides together we get, E(Y~ jx; ) = + ~ = E(X) y = + x: (3.42) We now have one equation two solve for the two unknowns. Since we have used up the means let us turn to the variances by multiplying both sides of 3.38 with x and perform the same operations again. THE LINEAR REGRESSION MODEL — A GENERAL DESCRIPTION 31 Multiplication with x and g(x) leads to, Z xyD(yj x ; )dyg(x ) = x g(x ) + x 2 g(x ); (3.43) Integrate over x, Z Z = Z xyD(yj x ; )dydxg(x ) Z x g(x )dx + x 2 g(x )dx : (3.44) The LHS leads to, Z Z and the RHS, Z ~Y ~ ); )dydx = E (X xyD(y; x j x g(x)dx + Z ~ + E(X) x2 g(x)dx = (3.45) ~ 2 ): E(X (3.46) Hence our second equation is, ~ Y~ ) = E(X ~ + E(X) ~ 2 ): E(X (3.47) ~ Y~ ) = x y + xy , Remembering the rules for the expectations operator, E(X 2 2 2 ~ and E(X ) = x + x makes it possible to solve for and in terms of means and variances. From the …rst equation we get for , = x: y (3.48) If we substitute this into 3.39, we get ~ Y~ ) E(X = = ( 2 x) x + ( x 2 2 x+ x+ y x y + 2 x ); x y + 2 x; xy (3.49) which gives xy : 2 x = (3.50) Using these expressions in the linear regression line leads to, E(Y~ j x; ) = y xy (x 2 x + x) = + x; (3.51) y) = + y: (3.52) or if we chose to condition on Y~ instead, ~ E(Xjy; )= x yx (y 2 y + We can now make use of the correlation coe¢ cient and the parameter in the ~ and Y~ is de…ned as, linear regression. The correlation coe¢ cient between X = xy or x y = xy : (3.53) x y If we put this into the equations above we get, E(Y~ jx; ) = 32 y + y x (x x ); (3.54) TIME SERIES MODELING - AN OVERVIEW ~ E(Xjy; )= x + x y (y y ): (3.55) So, if the two variables are independent their covariance is zero, and the correlation is also zero. Therefore, the conditional mean of each variable does not dependent on the mean and variance of the other variable. The …nal message is that a non-zero correlation, between two normal random variables, results in linear relationship between them. With a multivariate model, with more than two random variables, things are more complex. THE LINEAR REGRESSION MODEL — A GENERAL DESCRIPTION 33 34 TIME SERIES MODELING - AN OVERVIEW 4. THE METHOD OF MAXIMUM LIKELIHOOD There are two fundamental approaches to estimation in econometrics, the method of moments and the maximum likelihood method. The di¤erence is that the moments estimator deals with estimation without a priori choosing a speci…c density function. The maximum likelihood estimator (MLE), on the other hand, requires that a speci…c density function is chosen from the beginning. Asymptotically there is no di¤erence between the two approaches. The MLE is more general, and is the basis for all the various tests applied in practical modeling. In this section we will focus on MLE exclusively because of its central role. The principles of MLE were developed early, but for a long time it was considered mainly as a theoretical device, with limited practical use. The progress in computer capacity has changed this. Many presentations of the MLE are too complex for students below the advanced graduate level. The aim of this chapter is to change this. The principle of ML is not di¤erent from OLS. The way to learn MLE is to start with the simplest case, the estimation of the mean and the variance of a single normal random variable. In the next step, it is easy to show how the parameters of a simple linear regression model can be found, and tested, using the techniques of MLE. In the third step, we can analyse how the parameters of any density function. Finally, it is often interesting to study the bivariate joint normal density function. This last exercise is good for understanding when certain variables can be treated as exogenous. The general idea is that after viewing how a single random variable can be replaced by a function of random variables, it becomes obvious how a multivariate non-linear system of variables can be estimated. Let us start with a single stochastic time series. The …rst moment, or the ~ t with the observations (x1 ; x2 ; :::; xT ) is sample mean, of the random process X PT found as x = t=1 xt =T . By using this technique we simply calculated a number ~ t . In the same way that we can use to describe one characteristic of the process X we can calculate the second moment around the mean, etc. In the long run, and for a stationary variable, we can use the central limit theorem (CLT) to argue that (x1 ; x2 ; :::; xT ) has a normal distribution, which allows us to test for signi…cance etc. 4.1 MLE for a Univariate Process ~ t , and a sample of T indeThe MLE approach starts from a random variable X pendent observations (x1 ; x2 :::; xT ). The joint density function is f (x1 ; x2 ; :::; xT ; ) = f (x; ) = To describe this process there are k parameters, the density function as, f (x; ) THE METHOD OF MAXIMUM LIKELIHOOD T Y f (xt ; ) = ( 1; 2 ; :::; k ); (4.1) so we write (4.2) 35 where x; indicates that it is the shape of the density, described by the parameters which gives us the sample. If the density function describes a normal distribution would consistent of two parameters the mean and the variance. Now, suppose that we know the functional form of the density function. If we ~ t , we can ask the question which estimates also have a sample of observations on X of would be the most likely to …nd, given the functional form of the density and given the observations. Viewing the density in this way amounts to asking which values of maximize the value of the density function. Formulating the estimation problem in this way leads to a restatement of the density function in terms of a likelihood function, L( ; x); (4.3) where the parameters are seen as a function of the sample. It is often convenient to work with the log of the likelihood instead, leading to the log likelihood log L( ; x) = l( ; x) (4.4) What is left is to …nd the maximum of this function with respect to the parameters in . The maximum, if it exists is found by solving the system of k simultaneous equations, l( ; x) = 0; (4.5) i for , which will be the log likelihood estimates ^, provided that D2 l( ; x) is a negative de…nite matrix. In matrix form this expression is also know as the score matrix, or the e¢ cient score for , which can be written as, l( ; x) = S( ); (4.6) such that the matrix of the e¢ cient score is zero at maximum. The matrix of the expected second order expressions is know as the information matrix 2 l( ; x) E = I( ): (4.7) 2 The information matrix plays an important role in demonstrating that ML estimators asymptotically attains the Cramer-Rao lower band, and in the derivation of the so-called classical test statistics associated with the ML estimator. It can be shown, under quite general conditions, that the variances of the estimated parameters from above (^) are given by the inverse of the information matrix, var(^) = [I( )] 1 : (4.8) So far we have not assigned any speci…c distribution to the density function. ~ t g. The Let us assume a sample of T independent normal random variables fX normal distribution is particularly easy two work with since it only requires two parameters to describe it. We want to estimate the …rst two moments, the mean 2 and the variance 2 , thus = ( ; ): The likelihood is, # " T X 1 T =2 2 (xt )2 : (4.9) L( ; x) = 2 exp 2 2 t=1 Taking logs of this expression yields, l( ; x) = (T =2) log 2 (T =2) log 2 (1=2 2 ) T X (xt )2 : (4.10) t=1 36 THE METHOD OF MAXIMUM LIKELIHOOD The partial derivative with respect to l = 2 and T 1 X 2 (xt are, ); (4.11) t=1 and, l 2 = (T =2 2 4 ) + (1=2 ) T X )2 : (xt (4.12) t=1 If these equations are set to zero, the result is, T X xt T =0 (4.13) t=1 T X (xt )2 2 T = 0: (4.14) t=1 If this system is solved for variance as1 ^x = ^ 2x = and T 1X xt T t=1 2 we get the estimates of the mean and the (4.15) T 1X (xt T t=1 " T #2 1 X xt : T t=1 T 1X 2 2 ^x) = x T t=1 t (4.16) Do these estimates of and 2 really represent the maximum solution of the likelihood function? To answer that question we have to look at the sign of the Hessian of the log likelihood function, the second order conditions, evaluated at estimated values of the parameters in ; 2 6 D l( ; x)=4 2 2 l 2 l 2 2 2 l 2 2 l 2 3 2 7 4 5= T 1 2 1 4 4 P (xt P T 2 4 ) (x ) Pt (xt If we substitute from the solutions of the estimates of 2 T 6 ^ 2x E[D l( ; x)]= 14 2 0 0 and )2 2 3 5 : (4.17) , we get, 3 7 5= I(^); T 2^ 4x (4.18) Since the variance, 2x is always positive we have a negative de…nite matrix, and a maximum value for the function at ^ x and ^ 2x : It remains to investigate whether the estimates are unbiased. Therefore, re~ place the observations, in the solutions for and 2x , by the random variable X and take expectation. The expected value of the mean is, E(^ x ) = T T X 1X ~ = 1 E(X) T t=1 T t=1 PT PT 2 solution is given by T1 ]2 = T1 t=1 [xt t=1 xt + PT P P T T 1 1 2+ 2 x 2 x t t t=1 T Pt=1 PTT t=1 T 1 2 2 2 T1 t=1 xt + T T t=1 xt hP i2 PT PT PT T 1 2 2 T12 t=1 xt + T 2 t=1 xt t=1 xt t=1 xt PT 2 1 P 2 x [ x ] t t=1 t T2 1 The = = = = 1 T 1 T 1 T 1 T MLE FOR A UNIVARIATE PROCESS = ; 2 (4.19) 2xt 37 which proves that ^ x is an unbiased estimation of the mean. The calculations for the variance are bit more complex, but the idea is the same. The expected variance is, E[^ 2x ] !2 # T 1 X ~ Xt T t=1 t=1 " # T X T X 1 1 ~ t2 ) ~tX ~s E = E T E(X X T T t=1 s=1 " !# T T X X 1 1 2 2 ~ ) E(X ~ ) ~tX ~s T E(X E X = t6=s t t T T 1 = E T " T X = ~ t2 X 1 (T T ~ t2 ) 1)E(X 1 T (T T ~ t )]2 = 1)[E(X T 1 2 (4.20) T Thus, ^ 2 is not an unbiased estimate of 2 . The bias given by (T 1)=T , goes to zero as T ! 1: This is a typical result from MLE, the mean is correct but the variance is biased. To get an unbiased estimate if we need to “correct” the estimate in the following manner, 2 !2 3 T T X X T 1 T 1 1 ~t2 ~t 5 : s2 = ^2 = E4 X X (4.21) T T T t=1 t=1 The correction involves multiplying the estimated variance with T 1 T : 4.2 MLE for a Linear Combination of Variables We have derived the maximum likelihood estimates for a single independent normal variable. How does this relate to a linear regression model? Earlier, when we discussed the moments of a variable, we showed how it was possible, as a general principle, to substitute a random variable with a function of the variable. ~ is a function of two other random The same reasoning applies here. Say that X variables Y~ and Z. Assume the linear model yt = zt + xt ; (4.22) where Y~ is a random variable, with observations fyt g and zt is, for the time being, assumed to be a deterministic variable.(This is not a necessary assumption). ~ let us Instead of using the symbol x, for observation on the random variable X; set xt = t where t N ID(0, 2 ): Thus, we have formulated a linear regression model with a white noise residual. This linear equation can be rewritten as, t = yt zt (4.23) where the RHS is the function to be substituted with the single normal variable xt used in the MLE example above. The algebra gets a bit more complicated but the principal steps are the same.2 The unknown parameters in this case are and 2 As a consequence of more complex algebra the computer algorithms for estimating the variables will also get more complex. For the ordinary econometrician there are a lot of software packages that cover most of the cases. 38 THE METHOD OF MAXIMUM LIKELIHOOD 2 . The log likelihood function will now look like, 2 l( ; ; y; z) = (T =2) log 2 2 (T =2) log (1=2 2 ) T X (yt zt )2 : (4.24) t=1 The last factor in this expression can be identi…ed as the sum of squares function, S( ). In matrix form we have, S( ) = T X Z )0 (Y zt )2 = (Y (yt Z ) (4.25) t=1 and l( ; 2 ; y; z) = (T =2) log 2 2 (T =2) log Di¤erentiation of S( ) with respect to S = (1=2 2 )(Y Z )0 (Y Z ) (4.26) yields 2Z 0 (Y Z ); (4.27) which, if set to zero, solves to ^ = (Z 0 Z) 1 (Z 0 Y ) (4.28) Notice that the ML estimator of the linear regression model is identical to the OLS estimator. The variance estimate is, ^ 2 = 0 =T; (4.29) which in contrast to the OLS estimate is biased. To obtain these estimates we did not have to make any direct assumptions about the distribution of yt or zt : The necessary and su¢ cient condition is that yt conditional on zt is normal, which means that yt zt = t should follow a normal distribution. This is the reason why MLE is feasible even though yt might be a dependent AR(p) process. In the AR(p) process the residual term is a independent normal random variable. The MLE is given by substitution of the independently distributed normal variable with the conditional mean of yt : The above results can be extended to a vector of normal random variables. In this case we have a multivariate normal distribution, where the density is D(X) = D(X1 ; X2 ; :::; XT ); (4.30) ~ P The random variables X will have a mean vector and a covariance matrix . The density function for the multivariate normal is, X X 1 D(X) = [(2 )n=2 j jn=2 ] 1 exp[ (1=2)(X )0 (X )] (4.31) P which can be expressed in a compact form Xt N ( ; ): With multivariate densities it is possible to handle systems of equations with stochastic variables, the typical case in econometrics. The bivariate normal is an ~ = (X ~1, X ~ 2 ), and often used device to derive models including 2 variables. Set X X = 2 1 21 12 2 2 with j X j= 2 2 1 2 (1 p2 ); (4.32) P where p is the correlation coe¢ cient. As can be seen j j> 1 unless p2 = 1. If 12 = 21 = 0; the two processes are independent and can estimated individually MLE FOR A LINEAR COMBINATION OF VARIABLES 39 without losing any important information. In principle if 12 = 21 6= 0; the two equations are dependent, and it will be necessary to estimate a complete system of equations to get correct estimates, which are unbiased and e¢ cient. A disadvantage with MLE is that the variance estimate is biased. This, however, is only a small sample e¤ect. It can be shown that as T goes to in…nity the bias disappears. Hence, the MLE is an asymptotically e¢ cient estimator. Furthermore, it can also be shown that MLE behaves asymptotically nice even if we drop the assumption of normally independently distributed residuals. The estimates will tend towards those given by NID errors. This situation is refereed to as quasi maximum likelihood. The advantages are easy to see. MLE o¤ers a general approach to the estimation of econometric models. These models can be quite complex, non-linearity, moving average residuals and so on can be handled by MLE. Consequently there exists a large literature on MLE. In principle this literature is not di¢ cult. The main problem for our understanding of the use of MLE in di¤erent situations lies in our understanding of matrix algebra. 40 THE METHOD OF MAXIMUM LIKELIHOOD 5. THE CLASSICAL TESTS - WALD,LM AND LR TESTS (To be completed, add …gure of normal distributed variable with value of likelihood function (L) on the vertical axis and parameter value on the horizontal axis, with (^) is indicating the maximum value of L). There are three approaches to testing a statistical model model. The …rst is to start with an unrestricted model and imposed restrictions on the estimated model. The second approach is to impose the restrictions prior to estimation, and estimate a restricted model. The test is then performed by asking if the restriction should be lifted. The third approach, is to test for signi…cant di¤erences between an estimated restricted model and an estimated unrestricted model. The last approach involves estimating two models, rather than one. The three approaches of testing are named Wald tests (W ) - estimate an unrestricted model. Lagrange Multiplier tests (LM ) -estimate a restricted model. Likelihood Ratio tests (LR) - estimate both the unrestricted and the restricted models. A test is labeled Wald, Lagrange Multiplier or Likelihood ratio depending on how it is constructed. A typical Wald test is the ”t-test”for signi…cance. A Lagrange multiplier test is the LM test of autocorrelation. Finally, the F-test for testing the signi…cance of one or several parameters in a group represents a typical Likelihood ratio test. Imagine a …gure of a normal density function, with the shape of a normal random variable centered around its (true) mean. On the vertical axis put the value of the likelihood function. The max is given by the peak of the distribution. Let the horizontal axis represent the estimated mean. The true mean is indicated by the peak of the normal distribution. The LR test is based on a comparison of likelihood values. If a restriction, which is imposed on the unrestricted model, is valid the value of the likelihood should not be reduced signi…cantly. This test is ^U based on two estimations, one unrestricted giving the value of the likelihood L ^ R : From these two values the likelihood ratio is and one restricted leading to L de…ned as, = ^R L : ^ LU (5.1) This lead to the test statistic ( 2 ln ) which has a 2 (R) distribution, where R is the number of restrictions. The Wald test compares (squared) estimated parameters with their variances. In a linear regression, if the residual is N ID(0; 2 ), then ^ N ( ; var( ^ )), so ^ ^ ( ) N (0; var( ); and a standard t-test will tell if is signi…cant or not. More generally if we have vector of normally distributed random variables P ^ Nj ( ; ), then have X (x ) 0 1 (x ) THE CLASSICAL TESTS - WALD,LM AND LR TESTS 2 (J): (5.2) 41 The LM test starts from a restricted model and tests if the restrictions are valid. Here restrictions should be understood as a general concept. A model is restricted if it assumes homoscedasticity, no autocorrelation, etc. The test is formulated as, ln L(^R ) h ^ i 1 ln L(^R ) I( R ) : (5.3) LM = ^R ^R The formula looks complex but is in many cases extremely easy to apply. Consider the LM test for p : th order autocorrelation in the residuals ^t , ^t = 0 + 1^t 1 + 2^t 2 + ::: + p^t p + t: (5.4) The LM test statistic for testing if the parameters 1 to p are zero, amounts to estimating the equation with OLS and calculate the test statistics T R2 , distributed as 2 (p) under the null of no autocorrelation. Similar tests can be formulated for testing various forms of heteroscedasticity. Tests can often be formulated in such a way that they follow both 2 and F -distributions. In less than large samples the F -distribution is better one to use. The general rule for choosing among tests based on the F or the 2 distribution is to use the F distribution, since it has better the small sample properties. If the information matrix is known (meaning that it is not necessary to estimate it), all three tests would lead to the same test statistic, regardless of the chosen distribution 2 or F . I all all three approaches lead to the same test statistics, we would have RW = RLR = RLM . However, when the information matrix is estimated we get the following relation between the tests RW RLR RLM . Remember (1) that when dealing with limited samples the three tests might lead to di¤erent conclusions, and (2) if the null is rejected the alternative can never be accepted. As a matter of principle, statistical tests only rejects the null hypothesis. Rejection of the null does not lead to accepting the alternative hypothesis, it leads only to the formulation of new null. As an example, in a test where the null hypothesis is homoscedasticity, the alternative is not necessarily heteroscedasticity. Tests are generally derived on the assumption that ”everything else” is OK in the model. Thus, in this example, rejection of homoscedasticity could be caused by autocorrelation, non-normality, etc. The econometrician has to search for all possible alternatives. 42 THE CLASSICAL TESTS - WALD,LM AND LR TESTS Part II Time Series Modeling 43 6. RANDOM WALKS, WHITE NOISE AND ALL THAT 6.1 Di¤erent types processes This section looks at di¤erent types of stochastic time series processes that are important in the economics and …nance. Time series is a series where the data ~ is a variable which can take on more is ordered by time. A random variable (X) than one value, and for each value it can take one there is a value between zero and one that describes the probability of observing that value. We distinguish between discrete and continuous random variables. Discrete random variables can only take on a …nite number of outcomes. A continuous random variable can take one value between -1 and +1: The mathematical model of the probabilities associated with a random variable is given by the distribution function F (x), ~ F (x) = P (X x): If we have a continuous random variable, we can de…ne the probability density function of the random variable as, f (x) = dFdx(x) : Random variables are characterized by the probability functions, and their moments. First, second, third and fourth moments all describe the characteristics of a random variable. By estimating these we describe a random variable. All moments have direct implication for risk-and return decisions. Mean = return, Variance = risk, skewness and kurtosis implies deviations from normal and might a¤ect behavior. To be completed. A stochastic time series process is then made up of a random variable that over time can take on more than one value. ~ t gT indicating that it starts at time zero We denote a stochastic process as fX 0 and continuous to time T . To de…ne a stochastic time series process we start ~ t ), which at time t can take on di¤erent values with the random variable (X i at the future periods i = 1; 2; 3; ::n; where n might go to in…nity. Often we ~ t ), we want to estimate the most will talk about conditional expectation of (X likely future value, given the information we have today. A stochastic time series process can be discrete or continuous. A discrete series is only changing values at discrete time periods, while a continuous process is, or can potentially, change values continuously and not only at discrete time intervals. ~ t+1 jIt ) or Et (X ~ t+1 ). To formalize The conditional expectation is written as E(X the use of conditional expectations, assume a probability space ( ; z; P ), where is the total sample space (or possible states of the world), z denotes the tribe of subsets of that are outcomes (observations), and P is a probability measure associated with the outcomes. A very practical question in modeling is if there exists a simple mathematical form for associating outcomes with probabilities. Usually we will refer to the tribe of subsets z as the information set It :We will assume that memory is not forgotten by the decision makers, so the information set is increasing over time, It0 It1 ::: Itk Itk+1 ::: In a discrete time setting we refer to this increasing sets as an increasing sequence of sigma-…elds. In a continuous time setting, where new information arrives continuously, rather than at discrete time intervals, the increasing information set is referred to as a …ltration, or an increasing family of sigma-algebra. A very uno¢ RANDOM WALKS, WHITE NOISE AND ALL THAT 45 cial standard is to use It discrete time settings and zt for continuous time settings. We can also say that the set fFt :t 0g is a …ltration, representing increasing fam~ t , (x1 ; x2 ; :::; :xt ); will ily of sub- sigma algebras on z: Over time outcomes of X i be added to the increasing family of information sets. We refer to the observed process, (x1 ; x2 ; ::; xt ), as adapted to the …ltration zt : We can also say that if ~ t is a random (x1 ; x2 ; ::; xt ) is an adapted process, then for the sequence of fxt g X ~ t is know as xt . variable with respect to f ; z), and for each t the value of X 6.2 White Noise A random variable is a white noise process if its expected mean is equal to zero, E[ t ] = = 0; (6.1) its variance exists and is constant 2 , and there is no memory in the process so the autocorrelation function is zero, E[ t t] = E[ t s] = 2 (6.2) 0 f or t 6= s: (6.3) In addition, the white noise process is supposed to follow a normal and independent distribution, t N ID(0; 2 ). A p standardized white noise have a distribution like N ID(0; 1). Dividing t with 1= 2 gives ( t = ) ~N ID(0; 1): The independent normal distribution has some important characteristics. First, if we add normal random variables together, the sum will have a mean equal to the sum ofPthe mean of all variables. Thus, adding T white noise variables together as, T zT = t=1 ( t = ) forms a new variables with mean E(zT ) = E( 1 = ) + E ( 2 = ) + :: + E ( T = ) = (1= ) [E( 1 ) + E( 2 ) + ::: + E( T )] = 0: Since each variable is independent, we have the variance as 2z = 2z;1 + 2z;2 + :: + 2z;T = 1 + 1 + :: + 1 = T . The random p variable is distributed as zt ~N ID(0; T ); with a standard deviation for zt increases, a 95% forecast con…dence given as 1= T : As the forecast horizon p interval also increases with 1:96 T : In the same way, we can de…ne the distribution, mean and variance during subsets of time. If t ~N (0; 1) is de…ned for the period of year. The variables will p be distributed over six months as, N (0; 1=2), with a standard deviation of 1= 2, p over three months the distribution is N (0; 1=4), with a standard deviation of 1= 4. For any fraction ( ) over p the year, the distribution becomes N ID(0; 1= ) and the standard deviation 1= : This property of the variable following from the assumption of independent distribution, is known as Markov property. Given that x0 is generated from an independent normal distribution N ( ; 2 ); the expected future value of xt at time x0+T is distributed as N ( T; 2 T ). To sum up, it follows from the de…nition that a white noise process is not linearly predictable from its own past. The expected mean of a white noise, conditional on its history is zero, E[ t j t 1 ; t 2 ; :::: 1 ] = E [ t ] = 0: (6.4) This is a relatively weak condition. A white noise process might be predicted by other variables, and by its own past using non-linear functions. A process is called an innovation if it is unpredictable given some information set It . A process yt is an innovation process w.r.t. the an information set if, E[yt j It ) = 0: 46 (6.5) RANDOM WALKS, WHITE NOISE AND ALL THAT where the information set It includes not only the history of t , but also all other information which might be of importance for explaining this process. Stating that a series is a white noise innovation process, with respect to some information set It ; is a stronger requirement than white noise process. It is also a stronger statement than saying that t is a martingale di¤erence process, because we add the assumptions of a normal distribution. The martingale and the martingale di¤erence processes were de…ned in terms of their …rst moments only. Creating a residual process that is a white noise innovation term is a basic requirement in the modelling process. 6.3 The Log Normal Distribution The normal distribution is central in econometric modeling. However, …nancial prices display two characteristics which make them un…t for a stochastic process based on the assumption normal distributions. Stock prices cannot be negative, due to limited liability, and they tend to grow over time due to the time value of money. Thus, the distribution of stock prices is typically non-negative and skewed. The normal distribution on the other hand is symmetric and stretches from 1 to +1: A better alternative for modelling stock prices, and many other asset prices, is to assume a log normal distribution, which compared to the normal, is only de…ned over [0; 1], and is right skewed and re‡ecting the fact that stock prices have a tendency to move up rather than down. Furthermore, log normal distribution have the property that the log of a log normal random variable has normal distribution. Thus, taking the log of log normal random stock prices transforms their distribution to a normal distribution. Let S~ti be a random log normal stock hprice, with mean and variance 2 The i 2 ; 2 . log of st , is then distributed as ln st N 2 Given that S~t , has a log normal distribution, ithfollows that the idistance be2 tween S~t and S~t+n is distributed as S~t+n S~t N n; 2 n : s 6.4 The ARIMA Model The non-parametric white noise can be used to de…ne (or generate) autoregressive models (AR), and moving average models (MA). The AR(p) model is yt = where xt i , + a1 yt 1 + ::: + yt is E(yt ) = , and ~N ID(o; 2 A(L)yt = p + t; (6.6) ): Or using the lag operator, Li xt = t; (6.7) where A(L) = (1 a1 L a2 L2 ::: ap Lp ). The eigenvalues associated with this polynomial informs about the time path of yt . The moving vicarage model of order q is, THE LOG NORMAL DISTRIBUTION 47 yt = + t + b1 t 1 + ::: + bq t q; (6.8) or, using the lag operator, yt = + B(L) t : (6.9) 6.5 The Random Walk Model A special case of the AR(1) model is the random walk model, xt = xt 1 + t where N ID(0; t 2 ): (6.10) where xt 1 is the lagged value of xt , with an implicit parameter of unity, and is a white noise process. It follows that given the past of the series the best prediction we can use is the present value of the series, and that the …rst di¤erence is nothing else than a white noise, xt xt 1 = xt = t . The important factor is that the increments of the series is unpredictable from the series own past. A random walk is non-stationarity. By de…nition, it is integrated of order one I(1). Taking the …rst di¤erence of a random walk series produces a stationary I(0) (white noise) series. A random walk has the property that today’s value is the prediction of the variables future values, t E(xt+1 j xt ; xt 1 ; xt 2 ; :::; xt n ) = E(xt+1 j xt ) = xt ; (6.11) where n might be equal to in…nity. This de…nition does not rule out the case that there are other variables that can be correlated with xt and thereby also predict xt+1 . We can also say that a random walk has an in…nite long memory. ~ = 2t ;and The mean is zero, the variance and autocovariance is equal to, var(X) 2 ~ ~ Cov(Xt , Xt n ) = (t n) : E(xt ) = E t X = 0; i (6.12) i=1 var(xt ) = E(x2t ) =E The …rst autocovariance is (t 2 cov(xt xt 1) = E(xt xt 1) =E4 " t X i=1 i !#2 = t X t X E [ei ej ] = t: (6.13) i=1 j=1 1), t X i=1 ! 0 t 1 13 t X t 1 X X ei @ ej A5 = E [ei ej ] = t j=1 1: i=1 j=1 (6.14) The autocovariances foe higher lag order follows from this previous example. As can be seen these are non-stationary moments, since both are dependent on time (t). It follows that the autocorrelation function looks like, n = [(t 1=2 n)=t] : (6.15) We can see that, given su¢ ciently large number of observations, there is an in…nite memory. All theoretical autocorrelations are equal to 1.0. 48 RANDOM WALKS, WHITE NOISE AND ALL THAT If xt = (x0 , x1 , ..., xn ), we can substitute repeatedly backwards, xt = x0 + t X i: (6.16) i=0 Thus, a random walk is a sum of white noise error from the beginning of the series (x0 ). Hence, the value of today is dependent on shocks from built up beginning of the series. All shocks in the past, are still a¤ecting the seriesPtoday. t Furthermore, all shocks are equally important. The process formed by i=1 is called a stochastic trend. In contrast to a deterministic trend, the stochastic trend is changing its slope in a random way period by period. Ex post a stochastic trend might look like deterministic trend. Thus, it is not really possible to determine whether a variable is driven by a stochastic or a deterministic trend, or a combination of both. If we add a constant term to the model we get a random walk with a drift, xt = + xt 1 + t; (6.17) where the constant represents the drift term. In this processxt is driven by both a deterministic and a stochastic trend. If we perform the same backward substitution as above, we get, xt = t + t X i + x0 ; (6.18) i=1 where t = 1; 2; :::; n. Thus, a constant term in a random walk model implies that the variable follows a linear deterministic trend ( t) and a stochastic trend in the long run. In the long-run the deterministic trend will dominate the stochastic trend and determine the path of xt . Taking …rst di¤erences leads to, xt = + t; (6.19) where the constant measures the average growth rate of xt , since E( xt ) = : The expected value of a driftless random walk, for any future date is always today’s value, E(xt+n ) = xt : For a random walk with a drift the expected value is, E(xt+n ) = n + xt At a …rst glance the random walk model might seem extreme, is it possible to motivate that a series has an in…nite memory, so that shocks remain in the series forever? The answer is yes. The most common example is that of innovations leading to economic growth, which then spills over into other economic variables. Innovations leading to economic growth do not occur at …xed intervals, nor is every single invention equally important. Over time, innovations will occur at random intervals and some inventions will more important that others. The outcome is that productivity and economic growth is driven by a stochastic trend, just as described by a random walk. In empirical work it is common to …nd variables that behave like random walks. Given forward looking behavior of economic agents, it is often possible to construct economic models where transformed variables will behave like random walks. In a forward looking world agents will use all relevant information when they determine today’s prices. One important characteristic follows from this, namely that today’s price is the best prediction of future prices. However, the relationship between today’s price and the predicted future price is more complex. We return to this issue below, when we talk about martingales. THE RANDOM WALK MODEL 49 A note on the estimation and testing of random walks A random walk process is also a series integrated of order one, it is also called a unit root process, and it contains a stochastic trend. Furthermore a random walk process can also be embedded in another process, say an ARIM A(p; d; q)process. The problem is that it is problematic to do inference on random walk variables (and integrated variables) because the estimated parameter on the lagged term will not follow a standard normal distribution. Hence, ordinary t , chi square and F distributions are not suitable for inference. Parameter estimates will generally be asymptotically unbiased. Their standard errors and variances do not follow standard distributions. For instance, a common t-test cannot be used to test if a = 1 in the regression, xt = axt 1 + t: (6.20) If xt follows a random walk, the distribution of [^ a=st:dev(^ a)] will be skewed to the left, and thus depart from the student t-table. Just as in any autoregressive model the estimate a ^ will be biased downward. The term (^ a a); however, becomes asymptotically a ratio between two random variables, which will lead to a second order bias in the estimation of the variance as T ! 1. In this case, with a unit root process, the ration random variables which in turn are functions of Wiener processes. In this situation one common approach is to use the so-called DickeyFuller test in combination simulated distributions. Testing for a unit root (a = 1) is one aspect of testing if a variable is a random walk. Another aspect if it is not possible to reject a unit root is to test if the residual is ~N ID(0; 2 ). Cambell, Lo and MacKinley (1997, Ch 1) show how you can test for the absence of autocorrelation when dealing with the null hypothesis of a random walk. unfortunately, it is quite common in the literature to assume that a series is a random walk (meaning not rejecting the null of a unit root) only on unit root testing and forgetting about the properties of the residual term, which under a random walk is simply the …rst di¤erence. When testing for random walk in limited samples it is extremely di¢ cult to distinguish between a random walk and a stationary AR(1) model with a parameter of say 0.99. A problem with random walks, as well as all variables which include stochastic trends, is that it is in general not possible to use standard distributions for inference. Parameter estimates will generally be unbiased, but their standard deviations and variances do not follow standard distributions. 6.6 Martingale Processes A random variable is said to be a martingale if the present observation is the best ~ t g1 be a process of the random variable prediction of all future values. Let fX t=1 ~ Xt . We say that the variable is a martingale with respect to the information set ~ t+s is equal to the present value of X ~t; It 1 , if the expected value of X h i ~ t+s j It = xt for s > t: E X (6.21) 1 Alternatively, it is possible to de…ne the information set at time t-1, and wrire the de…nition h i ~ t j It 1 = X ~t 1 : as E X 50 RANDOM WALKS, WHITE NOISE AND ALL THAT ~ t+s is conGiven the information set, all information relevant for predicting X ~ t . Thus, the best prediction of X ~ t+1 is xt ; and the tained in today’s value of X value of today is the best prediction of all periods in the future. The information ~ t as well as all other information that might be set might include the history of X ~ t+s . The de…nition of a martingale is always relative, of relevance for predicting X ~ t is a marsince we have the freedom of de…ning di¤erent information sets. If X tingale with respect to the information set It0 , it might not be a martingale with respect to another information set It00 unless the two sets are not identical. We can now continue and de…ne the martingale di¤erence process as the ex~ t+s and X ~t; pected di¤erence between X ~ t+s E[(X ~ t ) j It ] = E(X ~ t+s X xt ) = 0: (6.22) If a process is a martingale di¤erence process, changes in the process are unpredictable from the information set. The sub-martingale and the super martingale are two versions of martingale processes. A sub-martingale is de…ned as h i ~ t+s j It E X xt ; which says that, on average the expected value is growing over time. A supermartingale is de…ned as h i ~ t+s j It E X xt ; ~ t+s is given by X ~ t but, on average, which says that the expected value of X declining over time. Martingales are well known in the …nancial literature. If the agents on a …nancial market use all relevant information to predict the yields of …nancial assets, the prices of these assets will, under certain special conditions, behave like martingales. The random walk hypothesis of asset prices does not come from …nance theory, it is based on empirical observations, and is mainly a hypothesis about the empirical behavior of asset prices which lacks a theoretical foundation. A random walk process is a martingale, but also includes statements about distributions. If we compare with the random walk we have the model, xt = xt 1 + t where t is a normally distributed white noise process. The latter is a stronger condition than assuming a martingale process. A random walk with a drift xt = t + xt 1 + t , this variable is a sub-martingale,since the deterministic trend will increase the ~ t+1 ) = t + xt : Let us now turn to …nance theory. expectation over time, E(X Theory that the price of an asset (Pt+1 ) at time t + 1 is given by the price at t plus a risk-adjusted discount factor r. If we assume, for simplicity, that the discount factor is a constant we get that Pt+1 = (1 + r)Pt : Asset prices are therefore not driftless random walks, or martingales. The process described by theory is ln Pt+1 = ln(1 + r) + ln Pt + t+1 , which is a sub-martingale given, in this case, a constant discount factor. If we would like to say that asset prices are martingales we must either transform the price process according [Pt+1 =(1 + r)], or we must include the risk-adjusted discount factor in the information set.2 Thus, the expected value of an asset price is, by de…nition, E(Pt+1 ) = E(1 + r)Pt . If the discount factor (and risk) is a constant (g) we get E(Pt+1 ) = g + Pt , which is a random walk with drift. If the risk premium is a time-varying stochastic 2 It is obvious that we can transform a variable into a martingale by substracting elements from the process by conditioning or direct calculation. In fact most variables can be transformed into a martingale in this way. An alternative way of transforming a variable into a martingale is to transform its probability distribution. In this method you look for a probability distribution which is ’equivalent’to the one generating the conditional expectations. This type of distribution is called an equivalent martingale distribution. MARTINGALE PROCESSES 51 ~ t ), we have E(Pt+1 ) = gt + Pt , which takes us even further away from variable (G the random walk. It is important to distinguish between martingales and random walks. Financial theory ends in statements about the expected mean of a variable with respect to a given information set. A random walk is de…ned in terms of its own past only. Thus, saying that a variable is a random walk does not exclude the case that there exists an information set for which the variable is not a martingale. Furthermore, the residuals in a random walk model are by de…nition independent, if we assume them to be white noise. But, a martingale describes behavior of the …rst moment of a random variable. It does not imply independence between the higher moments of the series. If we model a martingale by a …rst order autoregressive process, we might …nd that the errors are dependent through higher moments. The variance of t is not 2 , but a function of its own past, like 2 t = + 2 t 1 + t; (6.23) where t is a white noise process. This is a …rst order ARCH(1) model (Auto Regressive Conditional Heteroscedasticity), which implies that a large shock to the series is likely to be followed by another large shock. In addition, it implies that the residuals are not independent of each other. The conclusion is that we must be careful when reading articles which claim that the exchange rate, or some other variable should be, or is, random walks, often what the authors really mean is that the variable is a martingale, conditional on some information. The martingale property is directly related to the e¢ cient market hypothesis (EMH), which set out the conditions under which changes in asset prices becomes unpredictable given di¤erent types of information. 6.7 Markov Processes Markov3 processes represent a general type of series with the property that the value at time t contains all information necessary to form probability assessments of all future values of the variable. Compared with the martingale property above, this property is more far reaching. The martingale property is concerned with the conditional expectation of a variable, and not with the actual distribution function and the higher moments of the variable. Markov processes and the associated Markov property are important because it helps us to form stochastic time series processes. In economics and …nance we like explain how expectations are generated and how expectations a¤ects the outcome of observed prices and quantities on various markets. In particular, in …nancial economics and the pricing of derivatives, we like to model asset prices as continuous stochastic processes Once we can trace the price of asset continuously over time into the future, we can also determine the price of derivatives though replication and arbitrage In addition, we learn how to use derivatives to continuously hedge risky positions.4 To predict or generate future possible paths of a Markov variable, we only need to know the most recent value, or its recent values of the variable. This is, 3 Markov is known for a number of results, including the so-called Markov estimates that prove the equality between OLS and MLE. 4 Recall that the de…nition of a derivative asset, is a …nancial contract that (1) ’derives’ its value from some underlying asset, and (2) at the time of expiration has exactly the same price as the underlying asset. 52 RANDOM WALKS, WHITE NOISE AND ALL THAT in many modeling situations a very practical assumption, we do not need to know the history of the variable to learn how it ’behaves’nor do we need to know actual values/observations of the future. The future of the series can be generated from its conditional past. ~t: Let F (x1 ; x2 ; :::; xt ) be the distribution function of the random variable X There are 1; 2; ::t observations of the series, where t might be equal to in…nity. For each observation (xi ) there is a probability statement, F (x1 ; x2 ; :::; xt ) = ~1 ~2 ~t Pr ob(X x1 ; X x2 ; :::X xt ). A discrete time Markov process is characterized by the following property, ~ t+s Pr ob(X ~ t+s xt + s j x1 ; x2 ; :::xt ) = Pr ob(X xt+s j xt ); (6.24) where s > 0: The expression says that all probability statements of future ~ t+s is only dependent on the value the variable values of the random variable X takes at time t, and do not depend on earlier realizations. By stating that a variable is a Markov process we put a restriction on the memory of the process. The AR(1) model, and the random walk, are …rst-order Markov process, xt = a1 xt 1 + t where N ID(0; t 2 ): (6.25) Given that we know that t is a white noise process (N ID(0; 2 )]; and can observe xt we know all what there is to know about xt+1 = ; since xt = contains all information about the future. In practical terms, it is not necessary to work with the whole series, only a limited ’present’. we can also say that the ’future’ of the process, given the ’present’, is independent of the ’past’. For a …rst order ~ t+1 , given all its possible present and Markov process, the expected value of X ~tX ~t 1, X ~ t 2 :::, can be expressed as, historical values X h ~ t+1 j X; ~ X ~t E X ~ ~ 1 ; Xt 2 :::Xt 1 i h i ~ t+1 j X ~t : =E X (6.26) Thus, a …rst order Markov process is also a martingale. Typically, the value of ~ t is know at time t as xt : The Markov property is a very convenient property if we X want to build theoretical models describing the continuous evolution of asset prices. We can focuses on the value today, and generate future time series, irrespective of the past history of the process. Furthermore, at each period in future we can easily determine an ’exact’future value, which is the ’equilibrium’price for that period. The white noise process, as an example, is a Markov process. This follows from the fact that we assumed that each t was independent from its own past, and future. One outcome of the assumption of a normal and independent process, was that we could relatively easy form predictions and con…dence intervals given only the value of t ’today. The de…nition of a Markov process can be extended to an m : th order Markov processes, for which we have; h ~ t+1 j X ~t; X ~t E X ~ ~ 1 ; Xt 2 :::Xt 1 i h ~ t+1 j X; ~ X ~t =E X ~ ~ 1 ; Xt 2 :::Xt m : i ; (6.27) where we need to condition on m historical (random) values (including the ~ t ) to predict the future. ’present’value X MARKOV PROCESSES 53 6.8 Brownian Motions Consider the random walk model, xt = xt 1 + t and assume that the distance between t and t 1 becomes smaller and smaller. As the distance between the observations gets smaller the function will in the end get so close to a continuous function that it becomes indistinguishable from a function in continuous time x(t) = x(t 1) + (t): This takes us to the random walk in continuous time, known as a Brownian motion or Wiener process. This section introduces, Brownian motions (Wiener process), geometric Brownian motion, jump di¤usion models and Ornstein- Uhlenbeck process. There are (at least) two very important reasons for studying Wiener processes. The …rst is that the limiting distribution of most non-stationary variables in economics and …nance are given as functions of a Brownian motion. It is this knowledge that helps us to understand the distribution of estimates based on nonstationary variables. The second reason for learning about Brownian motions is that they play an important role in modeling asset prices in …nance. A word of warning, though Brownian motions have nice mathematical properties it is not necessarily so that it also …ts given data series better. Normal discrete empirical modelling will take you a long way. The random walk is de…ned in discrete time. The intuition behind the random walk and the Brownian motion is as follows. If we let the steps between t and t 1 become in…nitely small, the random walk can be said to converge to Brownian motion (or Wiener process. As the distance between t and t 1, alternatively between t and t+1, it becomes harder and harder to distinguish between a discrete time process and continuous time process. In the end, the di¤erence will be so small that it will not matter. These processes have a long history. The Brownian motion was named after an English botanist, Robert Brown, who in 1827 observed that small particles immersed in a liquid, exhibited ceaseless irregular motion. Brown himself, however, named a few persons who had observed this phenomena before him. In 1900 a french mathematical named Bachelier described the random variation in stocks prices when he wanted to explain option prices. In 1917 Einstein observed similar behavior gas molecules. Finally, Norbert Wiener gave the process a rigorous mathematical treatment in a series of papers during 1918 and 1923. Is there a di¤erence between what we call a Wiener processes and Brownian motion? In practice the answer is no. The two terms can and are used interchangeably. If you look at the details you will …nd that the Brownian motion have normally distributed increments. The Wiener process, on the other hand, is explicitly assumed to be a martingale. No such statement is made for the Brownian motion.5 In practice, these di¤erences means nothing (for more information search for the Lévy theorem). In econometrics there is a tendency to use Wiener processes to represent univariate processes and Brownian motion for multivariate processes. The most important characteristic of a Brownian motion is that all increments are independent, and not predictable from the past. Thus the Brownian motion can be said to be a martingale and it ful…lls the Markov property. The latter means that the distribution of future values at (t + dt) depends only the current value of x(t). This is a good characteristic of models describing insecurity, in particular situations when nature is evolving as a function of random steps that we cannot 5 See Neftci, Salih (2000), An Introduction to the Mathematics of Financial Derivatives, 2 ed. Academic Press, Amsterdam. 54 RANDOM WALKS, WHITE NOISE AND ALL THAT predict. The further we look into the future, the number of random changes gets larger, and probability statements about future events get harder and harder. A generalized (arithmetic) Brownian motion is written as dxt = dt + dWt (6.28) where d represent the continuous or in…nitesimal small change in the variable x over the time interval dt: This can be written as dxt = x(t + dt) x(t): The parameters and are real numbers (constants) where is strictly positive. As in a random walk the term dt represents the drift and dW can be said to add a stochastic noise to the series. W represents a standardized Wiener process, or Brownian motion, such that dW represents the di¤erential of the Brownian motion, and dWt = dW (t + dt) pW (t) has a standard normal distribution with mean zero and variance equal to dt: It is easy to see that dt represent a drift term. Take the expected value of the process, E(xt ) = dt + 0, both and dt are non-stochastic, and dW has an expected value of zero. It follows that 1 E(dxt ) dt represents the average change in x per unit of time. Of course, if = 0 we have a driftless random walk in continuous time, E(dxt ) = E(dWt ) = 0: The variance is V ar(dxt ) = 2 V ar(dW ) = 2 dt: Note shown here is that that the changes in x (dxt ) are independent and stationary. = 6.9 Brownian motions and the sum of white noise In terms of the change over a speci…c (possibly) observable time period we need to introduce the notation t to represent the change over some fraction of time t. By using this notation we can let t be a year or a month, and then by changing we can let the length of the period become smaller and smaller. The change due to the deterministic trend is written, per unit of time, as p t. The stochastic noise that we add to dx over a given interval is written as t, where t ~N ID(0; 1). In the limit, as ! 0, we have that x ! dt: In terms of small intervals the Brownian motion becomes xt = t+ p t: (6.29) To understand the asymptotic properties of the Brownian we could let ! 1, but there is a better way to see what happens. As we study a standardized Brownian motion/Wiener process W (t) over the interval, [0; T ] we will …nd that we can divide this interval into segments ti ti 1 ; 0 = t0 < t1 < t2 < ::: < ti < :::tn = T: (6.30) Let the length of each segment be = ti tip1 ; and p assume that there is a random variable Wt that takes on either the value or : Furthermore, assume that Wti is independent of Wtj for i 6= j, so that each increment is uncorrelated with other increments. The Wiener process is no de…ned as the sum of Wti as ! 1, which is the same as saying that as the interval [0; T ] is divided into …ner and …ner segments, we have BROWNIAN MOTIONS AND THE SUM OF WHITE NOISE 55 W (t) = n X i=1 wti as i ! 1 (6.31) p An extension of this, if t ~N ID(0; 2 ); is that let Wt = t = T will also converge to a Wiener process. Thus, the sum of a standardized white noise will also converge to a standardized Wiener process. This result is crucial for the understanding of the distribution a random walk and other ’unit root’variables. 6.9.1 The geometric Brownian motion The arithmetic Brownian motion is not well suited for asset prices as their changes seldom display a normal distribution. The log of asset prices, and return, is better described with a normal distribution. This takes us to the geometric Brownian motion dxt = dt + dWt xt What happens here is that we assume that ln xt has a normal distribution, meaning that xt follows a log normal distribution, and dt + dWt follows a normal variable. Ito’s lemma can be used to show that 2 d ln xt = 2 dt + dWt : The expected value of the geometric Brownian motion is E(dxt =xt ) = dt, and the variance is V ar(dxt =xt ) = 2 dt: There are several ways in which the model can be modi…ed to better suit real world asset prices. One way is to introduce jumps in the process, so-called "jump di¤usion models". This is done adding a Poisson process to the geometric Brownian motion, dxt = dt + dWt + Ut dN ( ); xt where Ut is a normally distributed random variable, Nt represent a Poisson process with intensity to account for jumps in the price process. The random walk model is good for asset prices, but not for interest rates. The movements of interest rates are more bounded than asset prices. In this case the so-called Ornstein-Uhlenbeck process provides a more realistic description of the dynamics, drt = (b rt )dt + Wt : Thus the idea behind the Ornstein-Uhlenbeck process is that it restricts the movements of the variable (r) to be mean reverting, or to stay in a band, around b, where b can be zero. 6.9.2 56 A more formal de…nition RANDOM WALKS, WHITE NOISE AND ALL THAT ~ ~ If X(t) is a Wiener process, 0 t < 1:The series always starts in zero, X(0) = ~ i ) are independent. In 0:and if t0 t1 t2 ... tn , then all increments of X(t terms of the density function we have, D [x(t1 ) x(t0 ); x(t2 ) x(t1 ); :::; x(tn ) n Y = D [x(ti ) x(ti 1 ) j t0 ; t1 ; :::; tn ] : x(tn 1) (6.33) The expected value of each increment is zero, h i ~ n ) X(t ~ n 1 ) = 0; E X(t with a variance h ~ X(t ~ var X(t) i 1) = 2 (t j t0 ; t1 ; :::; tn ] (6.32) (6.34) s); (6.35) where 0 s < t. Finally, since the increments are a martingale di¤erence process, we can assume that these increments follow a normal distribution, so ~ X(t) N [0, (t s)]: These assumptions lead to the density function, D[x(t)] = p 2 1 exp (2 )ti x21 2 2 t1 n Y (ti i=2 ti 1 ) (1=2) p exp (2 )t1 (xi xi 1 )2 (6.36) 2 2 (ti ti 1 ; When = 1, the process is called a standard Wiener process or standard Brownian motion. That the Brownian motion is quite special, can be seen from this density function. The sample path is continuous, but is not di¤erentiable. [In physics this is explained as the motion of a particle which at no time has a velocity]. Wiener processes are of interest in economics of many reasons. First, they o¤er a way of modeling uncertainty. Especially in …nancial markets, where we sometimes have an almost continuous stream of observations. Secondly, many macro economic variables appear to be integrated or near integrated. The limiting distributions of such variables are known to be best described as functions of Wiener processes. In general we must assume that these distributions are nonstandard. To sum up, there are …ve important things to remember about the Brownian motions/Wiener process; It represents the continuous time, (asymptotic) counterpart of random walks. It always starts at zero and are de…ned over 0 t < 1: The increments, any change between two points, regardless of the length of the intervals, are not predictable, are independent, and distributed as N (0, (t s) 2 ), for 0 s < t. It is continuous over 0 t < 1, but nowhere di¤erentiable. The intuition behind this result is that the di¤erential implies predictability, which would go against the previous condition. Finally, a function of a Brownian motion/Wiener process will behave like a Brownian motion/Wiener process. The last characteristic is important, because most economic time series variables can be classi…ed as, random walks, integrated or near-integrated processes. In practice this means that their variances, covariances etc. have distributions that are functionals of Brownian motions. Even in small samples will functionals of Brownian motions better describe the distributions associated with economic variables that display tendencies of stochastic growth. BROWNIAN MOTIONS AND THE SUM OF WHITE NOISE 57 58 RANDOM WALKS, WHITE NOISE AND ALL THAT 7. INTRODUCTIOO TO TIME SERIES MODELING "Time is a great teacher, but unfortunately it kills all its pupils" Louis Hector Berlioz A time series is simply data ordered by time. And, time series analysis is simply approaches that look for regularities in these data ordered by time. Stochastic time series play an important part in economics and …nance. To forecast and analyse these series it is necessary to take into account not only their stochastic nature but also the fact that they are non-stationary, dependent over time and are by nature correlated among each other. In theoretical models, the emphasis on intertemporal decision making highlights the role expectations play in a world where decisions must be made from information sets made up of stochastic processes. All time series techniques aim making the series more understandable by decomposing them into di¤erent parts. This can be done in several ways. This introduction’s aim is to give a general overview of the subject. A time series is any sequence ordered by time. The sequence can be either deterministic or stochastic. The primary interest in economics is in stochastic time series, where the sequence is made up by random variables. A sequence of stochastic variables ordered by time is called a stochastic time series process. These random variables making up the process can either be discrete, taking on a given set of integer numbers, or be continuous random variables taking on any real number between 1: While discrete random variables are possible they are not common. Stochastic time series can be analysed. in the time domain or in the frequency domain. The former approach analysis stochastic processes in given time periods like, days, weeks, years etc. The frequency approach aims at decomposing the process into frequencies by using trigonometric functions like sinuses, etc. Spectral analysis is an example of analysis that uses the frequency domain, to identify regularities like seasonal factors, trends, and systematic lags in adjustment etc. In economics and …nance, where we are faced with given observations and we study the behavior of agents operating in real time, the time domain is the most interesting road ahead. There are relatively few problems that are interesting to analyze in the frequency domain. Another dimension in modeling is processes in discrete time or in continuous time. The principal di¤erence here is that the stochastic variables in a continuous time process can be measured at any time t; and that they can take di¤erent values at any time. In a discrete time process, the variables are observed at …xed intervals of time (t), and they do not change between these observation points. Discrete time variables are not common in …nance and economics. There are few, if any variables that remain …xed between their points of observations. The distinction between continuous time and discrete time is not matter of measurability alone. A common mistake is to be confused the fact that economic variables are measured at discrete time intervals. The money stock is generally measured and recorded as an end-of-month value. The way of measuring the stock of money does not imply that it remains unchanged between the observation interval, instead it changes whenever the money market is open. The same holds for variables like production and consumption. These activities take place 24 hours a day, during the whole year. The are measured as the ‡ow of income and conINTRODUCTIOO TO TIME SERIES MODELING 59 sumption over a period, typically a quarter, representing the integral sum of these activities. Usually, a discrete time variable is written with a time subscript (xt ) while continuous time variables written as x(t). The continuous time approach has a number of bene…ts, but the cost and quality of the empirical results seldom motivate the continuous time approach. It is better to use discrete time approaches as an approximation to the underlying continuous time system. The cost for doing this simpli…cation is small compared with the complexity of continuous time analysis. This should not be understood as a rejection of continuous time approaches. Continuous time is good for analyzing a number of well de…ned problems like aggregation over time and individuals. In the end it should lead to a better understanding of adjustment speeds, stability conditions and interactions among economic time series, see Sjöö (1990, 1995).1 Thus, our interest is in analysing discrete time stochastic processes in the time domain. A time series process is generally indicated with brackets, like fyt g: In some situations it will be necessary to be more precise about the length of the process. Writing fyg1 1 indicates that he process start at period one and continues in…nitely. The process consists of random variables because we can view each element in fyt g as a random variable. Let the process go from the integer values 1 up to T: If necessary, to be exact, the …rst variable in the process can be written as yt1 the second variable yt2 etc. up until ytT : The distribution function of the process can then be written as F (yt1 ; yt2 ; :::; ytT ): In some situation it is necessary to start from the very beginning. A time series is data ordered by time. A stochastic time series is a set of random variables ordered by time. Let Y~it represent the stochastic variable Y~i given at time t. Observations on this random variable is often indicated as yit . In general terms a stochastic time series is a series of random variables ordered by time. A series starting at time t = 1 and n ending at timeo t = T , consisting of T di¤erent random variables is written as Y~1;1 ; Y~2;2 ; :::Y~T;T . Of course, assuming that the series is built up by individual random variables, with their own independent probability distributions is a complex thought. But, nothing in our de…nition of stochastic time series rules out that the data is made up by completely di¤erent random variables. Sometimes, to understand and …nd solutions to practical problems, it will be necessary to go all the way back to the most basic assumptions. Suppose we are given a time series consisting of yearly observations of interest rates, f6:6; 7:5; 5:9; 5:4; 5:5; 4:5; 4:3; 4:8g, the …rst question to ask is this a stochastic series in the sense that these number were generated by one stochastic process or perhaps several di¤erent stochastic processes? Further questions would be to ask if the process or processes are best represented as continuous or discrete, are the observations independent or dependent? Quite often we will assume that the series are generated by the same identical stochastic process in discrete time. Based on these assumptions the modelling process tries to …nd systematic historical patters and cross-correlations with other variables in the data. All time series methods aim at decomposing the series into separate parts in some way. The standard approach in time series analysis is to decompose as yt = Tt;d + St;d + Ct;d + It ; 1 We can also mention the di¤erent types of series that are used; stocks, ‡ows and price variables. Stocks are variables that can be observed at a point in time like, the money stock, inventories. Flows are variables that can only be observed over some period, like consumption or GDP. In this context price variables include prices, interest rates and similar variables which can be observed at a market at a given point in time. Combining these variables into multivariate process and constructing econometric models from observed variables in discrete time produces further problems, and in general they are quite di¢ cult to solve without using continuous time methods. Usually, careful discrete time models will reduce the problems to a large extent. 60 INTRODUCTIOO TO TIME SERIES MODELING where Td and Sd represents (deterministic) trend and seasonal components, Ct;d is deterministic cyclical components and I is process representing irregular factors2 . For time series econometrics this de…nition is limited. Instead, let fyt g be a stochastic time series process, composed as, yt = systematic components + unsystematic component = Td + Ts + Sd + Ss + fyt g + et , (7.1) where the systematic components include deterministic trends Td , stochastic trend Ts ; deterministic seasonals Sd stochastic seasonals Ss , a stationary process (or the short-run dynamics) yt , and …nally a white noise innovation term et : The modeling problem can be described as the problem of identifying the systematic components such that the residual becomes a white noise process. For all series,remember that any inference is potentially wrong, if not all components have been modeled correctly. This is so, regardless of whether we model a simple univariate series with time series techniques, a reduced system, a or a structural model. Inference is only valid for a correctly speci…ed model. Present ARIMA A class of models ARIMA (p,dq,) ARFIMA(p,d,q) models Operators Box Jenkins Identi…cation tools: ACF, PAVFS, Q-test Deal with: Non-stationarity, dynamics Trend Seasonal e¤ects Deterministic variables Theory ARIMA: After ARIMA? ARIMAX, Transfer function RDL, ARCH/GARCH Structural: Single equation, ADL Error correction modes (Older stu¤) Mulivariate VAR VECM SVAR Add for VAR : How to build VAR:s Lags - white noise Lags dummies white noise Information criteria + min number of equations with AR Add for Rational expectations GMM GARCH 2 For simplicity we assume a linear process. An alternative is to assume that the components are multiplicative, xt = Tt;d St;d Ct;d It : INTRODUCTIOO TO TIME SERIES MODELING 61 7.1 Descriptive Tools for Time Series Random variables are described by their moments. Stochastic time series can be described by their means, variances and autocovariances. Given a random variable Y~t which generates an observed process fyt g, the mean and the variance are EfY~t g = and varfY~t g: The autocovariance at lag k is k = cov(Y~t ; Y~t k) = E[Y~t E(Y~t )][Y~t k E(Y~t k )]: The dimension of a covariance measure is di¢ cult to understand in terms of the strength of the relation. For practical work, a more useful measure is provided by the autocorrelation, cov(Y~t ; Y~t k ) = k; k = q 0 ~ ~ var(Yt )var(Yt k ) where k is is the autocorrelation between a realisation of the series at time t and time t k: Since the autocorrelation comes out as a number between 0 and 1. The autocovariance operator can be applied to any lag, k 1; and is therefore generally referred to as the autocorrelation function. Furthermore, if the series have a stationary mean and variance, it does not matter if we calculate the correlation function (or the autocovariances) backwards or forwards, k = k : The ACF tells us the following, the higher the value of the stronger is the memory of the series. By studying how the autocorrelation changes as the distance between t and k changes a we can see if they tend to die out slowly or quickly, or remain constant for a given number of k.3 If the ACF is equal to unity and dies out slowly this is a sign of a non-stationary variable. On the other hand, if the ACF is zero it is a sign of a white noise process were no historical values can predict coming observation of the same series. for a random time series process, the sample autocorrelation function becomes PT k 1 y)(yt k y) t=1 (yt T k ^k = k = 0; 1; 2; 3:::; PT 1 y)2 t=1 (yt T PT where T is the number of observations, and y is the sample mean, y = (1=T ) i=1 yi . In practical work, the standard assumption is a constant variance over the sample, so that var(yt ) = var(yt k ). The sample autocorrelations are estimates of random variables they are therefore associated with variances. Bartlett (1946) shows that the variance of the k:th sample autocorrelation is 2 32 k X1 14 var(^k ) = 1+2 ^k 5 : T j=1 Given the variance, and the standard deviation of the estimated variable, it becomes possible to set up a signi…cance test. Asymptotically, this t-test has a normal distribution, with an expected value of zero under the null of no autocorrelation (no memory in the series). For a limited sample, a value of ^k larger than two times its standard error is considered signi…cant. The next question is how much autocorrelation is left between the observations at t and t k (Y~t and Y~t k ) after we remove (condition on) the autocorrelation between t and t k? Removing the autocorrelation means that we …rst calculate the mean of Y~t conditional on all observation on Y~t and Y~t k 1 ;another way of 3 Standard 62 practice is to calculate the …rst K T =4 sample autocorrelations. INTRODUCTIOO TO TIME SERIES MODELING expressing this is to say that we …lter Y~t from the in‡uence of all lags of Y~t between t 1 and t k 1: Using the expectations operator, we de…ne the conditional mean as EfY~t j yt 1 ; yt 2 ; :::yt k 1 g = Y~t . The partial autocorrelation is then the slope coe¢ cient in a regression between Y~t and Y~k . This leads to the following de…nition of the partial autocorrelation function k = cov(Y~t ; Y~t j yt 1 ; :::; yt var(Y~t k ) k k 1) : (7.2) The de…nition of the partial autocorrelation at lag k can be recognised as the coe¢ cient on the lag at t = k in the autoregressive regression: yt = a0 + a1 yt 1 + ::: + k yk + et : (7.3) Notice the di¤erence, the partial autocorrelation k is a de…nition, not an estimate. The …rst partial autocorrelation is estimated by regressing yt on yt 1 ; the second partial autocorrelation is estimated by regressing yt on yt 1 and yt 2 and so on.4 The partial autocorrelation functions can be estimated through regression techniques, by the so-called Yule-Walker estimator, alternatively using recursive techniques (Durbin 1961). The recursive technique utilises the fact that the …rst autocorrelation is equal to the …rst partial autocorrelation ^1 = ^ 1 , then given ^ 1 the higher order i are solved step by step in a recursive equation system. The complicating factor is to estimate the variance of the partial autocorrelation function. If a regression technique is used, the estimated regression variance of ( ^ k ) is not a correct estimate of the variance, because until the residual process is white noise, or at least free from autocorrelation the estimated variance is inef…cient. Furthermore, the other (older) techniques of estimating the PACFs do not involve a variance estimate in the same way as the OLS estimator of k : The solution, therefore, is to assume that the estimated ^ k : s are a white noise process. Anderson (1944) shows that the asymptotic variance of a white noise series is p ^ 1=T . This leads to the (asymptotic) signi…cance test, k =(1= T ): As a practical rule of thumb, in a limited sample, a test statistics greater than 2 is considered signi…cant, and lead to a rejection of the null of ^ k = 0: The PACF informs about the length of autoregressive process. The necessary number of lags to describe an autoregressive process of order p ends at p : A closer look at these measures, and the way they are calculated reveals that they are only interesting for stationary series. The same holds for the mean and the variance, and other moments. The two measures, the ACF and the P ACF; are complementary to other descriptive devices, such as the mean, the variance, kurtosis, etc. The ACF and the P ACF describe the memory of a process. They explain if and how a series can be predicted from its own past. They help us to identify which type of process we are studying, if it is a white noise process, an integrated process, an AR process, an MA process, or an ARMA process. A white noise series is recognized by its lack of signi…cant ACF and P ACF coe¢ cient. Integrated variables are identi…ed by the fact that their ACF dies out very slowly, in combination with at least one P ACF coe¢ cient close to unity. Stationary ARMA models are identi…ed with the following identi…cation scheme: AR(p) M A(q) ARM A ACF Tails o¤ Cuts o¤ at lag q Tails o¤ P ACF Cuts o¤ at lag p Tails o¤ Tails o¤ 4 Notice, that in the regression, the parameters a1 , a2 ,...ak 1 are not identical to 1 ; 2 ... due to the (possible) correlation between yt 1 and lower order lags like yt 2 etc. The regression formula only identi…es the ”last” coe¢ cient, at lag k, as the PACF k : t k 1 DESCRIPTIVE TOOLS FOR TIME SERIES 63 This identi…cation scheme above is a direct consequence of the properties of each type of model. And, the properties of each model can be calculated theoretically. These calculation are an important part of time series analysis and we will come back to these calculations below. The idea behind ARIMA modeling is to …rst calculate the ACF and the P ACF and use these to form an idea about the order of integration and the order of p and q. The second step, given what we know about the order of d, p, and q is then to estimate an ARIMA model. The third step is to test the estimated model for autocorrelation in the residual. The fourth step is reestimate models to …nd the best model according to the three criteria i) no autocorrelation, ii) the lowest possible residual variance and iii) not include so many parameters that it is becomes too complex. 7.1.1 Weak and Strong Stationarity A fundamental issue when analyzing time series processes is whether they are stationary or not. As a …rst, general de…nition, we can say that a non-stationary series changes its behavior over time such that the mean is changing over time. Many economic time series are non-stationary in the sense that they are growing over time, their estimated variances are also growing and the covariance function never dies out. In other words the calculation of the mean, autocovariance etc. are dependent on the time period we study, and inference becomes impossible. A stationary series on the other hand displays a behavior which is independent of the time period and it becomes possible to test for signi…cance. Non-stationarity must either be removed before modeling or included in the model. This requires that we know what type of non-stationarity we are dealing with. The problem with non-stationary is that a series can be non-stationary in an in…nite number of ways. And, to make the problem even more complex some types of non-stationarities will skew the distributions of the estimates such that inference based on standard distributions such as the t , the F or the 2 distributions are not only wrong but completely misleading. In order to model time series, we need to understand what non-stationarity is, how to estimate it and how to deal with it. 7.1.2 Weak Stationarity, Covariance Stationary and Ergodic Processes Of the two concepts, weak stationarity is the practical one. Weak stationarity is de…ned in terms of the …rst two moments of the process, the mean and the variance. A process fxt g is (weakly) stationary if (1) the mean is independent of time t, Efxt g = ; (2) the variance exists and is less than in…nity, varfxt g = 2 < 1; and (3) the autocovariance is covfxt ; xt 64 k) = k: INTRODUCTIOO TO TIME SERIES MODELING Thus, the mean and the variance are constant over time, and the covariance between two values of the process is only a function of the distance between the two points. A related concept is that of covariance stationarity if the autocovariances go to zero as the distance between the two points increases the series is said to be covariance stationary (or ergodic), cov(xt ; xt k) ! 0 as k ! 1: This de…nition brings us to the concept of ergodicity, which can be understood as a weak form of average asymptotic independence. The most important condition, but not su¢ cient, for a series to be ergodic is ! T X lim T 1 cov(xt ; xt k ) = 0: t!1 k=1 Compared with the former concept, cov(xt ; xt k ) ! 0, ergodicity implies a restriction on the strength of the covariance structure. As more and more autocovariances are calculated their mean should go to zero. The term ergodic is used in connection with stationarity conditions. 7.1.3 Strong Stationarity Strong stationarity is de…ned in terms of the distribution function fxt g. Suppose a process that is ordered from observation 1 up to observation T: Each observation up to T can be thought of as a random variable. Hence we can write the …rst variable in the process as xt1 the second variable xt2 etc. up until xtT : The distribution function for this process is F (xt1 ; xt2 ; :::; xtT ): Next, de…ne the distribution function fxt g for another time interval, namely t + j, where j = 1; 2; :::; T . This leads to the distribution function Fj (xt+j1 ; xt+j2 ; :::; xt+jT ). Strong stationarity requires that the two distribution functions are identical such that F (xt1 ; xt2 ; :::; xtT ) = Fj (xt+j1 ; xt+j2 ; :::; xt+jT ); meaning that the characteristics of the process are independent of time. We will get the same means, etc. independently of the time period we choose for our calculations. By letting j take di¤erent integer values we get the j : th order strong stationarity. Thus, j = 1 leads to …rst order (strong) stationarity, etc. Strong stationary incorporates the de…nition of weak stationarity. But, the practical problem is that it is di¢ cult to work with distribution functions for continuous random variables, so strong stationarity is mainly a theoretical concept. 1. (a) i. In this chapter we deal with a very broad class of models named ARMA models, autoregressive moving average models. These are a set of models that describe the process fxt g as a function of its own lags and a white noise process. The autoregressive models of order p [AR(p)], xt = a0 + a1 xt 1 + ::: + ap xt k + et ; where et is a white noise process. A moving average model of order q [M A(q)] is de…ned as xt = a 0 + et DESCRIPTIVE TOOLS FOR TIME SERIES b1 e t 1 ::: bq et q; 65 where et is a white noise process. The combination of autoregressive and moving average processes gives the ARIMA(p,q) model xt = a0 + a1 xt 1 + ::: + ap xt k +e b1 et 1 ::: bq e t q: In addition we have integrated processes. An integrated process is de…ned as follows: a process xt is said to be integrated of order I(d); if it contains no deterministic components, is non-stationary in levels, but becomes stationary after di¤erencing d times. Thus, a stationary series is denoted xt I(0), a …rst order integrated series is denoted as I(1); etc. To analyse time series it is necessary to introduce additional descriptive statistical tools beside means and variances. Then to handle the equations in an e¢ cient way we need a set of operators. Also, we need to classify time series as stationary or non-stationary. The descriptive devices are autocovariances, autocorrelations and partial autocorrelations. An important classi…cation is stationarity or non-stationarity. For this purpose we need the concepts of weak and strong stationarity, and ergodic processes. The operators needed are the sum operator, the lag operator and the di¤erence operator. 7.1.4 Finding the Optimal Lag Length and Information Criteria In empirical work, the question is to …nd the correct lag length. If we chose to few lags the model will be de…nition be misspeci…ed, and the assumption of normally distributed white noise residual will be wrong. On the other hand, adding more lags to the AR or M A process will make the model capture more of the possible memory of the process, but the estimates will be ine¢ cient. We need to add as few lags as possible, without rejecting the assumption of white noise residuals. The Box-Jenkin’s method suggests that we start with a relatively large number of lags and tests for autocorrelation. Among those models, which has no signi…cant autocorrelation, we then pick the model with the lowest possible information criteria. In the Box-Jenkins approach, testing for white noise is equal to testing for autocorrelation. The typical test for autocorrelation is the Box-Pearce test, also known as the portmanteau test, sometimes as the Q-test or the Ljung -Box test. To test for p:th order autocorrelation in a mean adjusted series, "t ; calculate the k:th order autocorrelation coe¢ cient, ^k = PT "t ^"t k t=k+1 ^ PT "2t t=1 ^ for r = 1; 2; :::p: The Box-Pearce test statistic is then given by BP = T p X ^2k : k=1 Under the null of no autocorrelation this test statistic has a 2 (p) distribution. The Box-Pearce statistics is best suited for testing the residual in an AR model. A modi…cation, for ARMA, and more general regression models, is the so called Box-Ljung statistics, BL = T (T + 2) p X r=1 66 ^2r (T r) ; INTRODUCTIOO TO TIME SERIES MODELING which is also distributed as 2 (p). Given that the residuals of the estimated ARMA model do not display autocorrelation, we can turn to the optimal lag length. Information criteria is simply version of adjusted R2 values. In an ordinary linear regression, as more explanatory variables are added to the model, the R2 value will go up, and the e¢ ciency of the estimated parameters down. To compare the R2 values of the same model, estimated with more or less explanatory variables it is necessary to look at the so called adjusted R2 values. The principle behind an Information criteria is create a measure that rewards us in the modelling process for reducing the residual variance, but punishes us for adding too many lags that makes the estimates ine¢ cient, and the predictions interval too wide. There are several information criteria. They are developed for special situations. In practice, however, they often tend to give the same answer in the end. The most well known criteria is Akiake’s Information Criteria (AIC). If we estimate an autoregressive model with k lags from a sample of T observations, the information Akaike’s information criteria is AIC = log ^ 2" + 2k=T; where ^ 2" is the estimated residual variance. Since an estimated residual variance gets smaller the more lags there are in the model, the last term (2k=T ) tries to compensate for the number of estimated parameters in the models. The smaller the value of the information criteria the better is the model, as long as there is no autocorrelation. For model with both AR and MA components Hannan and Rissanen suggested a di¤erent model, log ^ 2" + (p + q)(log T =)T; where p and q are the lag orders of the autoregressive and the moving average parts of the model. As for Akaike’s model the smaller the value the better the model. From these two ’original’ criteria a number of di¤erent criteria has been developed, such as Schwartz information criteria (SIC), the Bayesian information criteria (BIC) and Hatami’s information criteria (HIC). 7.1.5 The Lag Operator When dealing with time series and dynamic econometric models, the expressions are easier to handle with the backward shift operator (B) or the lag operator (L).5 The backward shift operator is the symbol most often used in statistical textbooks. Econometricians tend to use the lag operator more often. The …rst order lag operator is de…ned as, Lxt = xt 1; (7.4) or more generally as the n:th order lag operator, Ln xt = xt n: (7.5) The lag operator is an expression such that when its is multiplied with an observation at any given time, it will shift the observation one period backwards 5 The practical di¤erence between using the lag operator or the backward shift operator is that the lag operator also a¤ects the conditional expectations generator Et which is of interest when working with economic theories dealing with expectations. DESCRIPTIVE TOOLS FOR TIME SERIES 67 in time. In other words, the lag operator can be viewed as a time traveling device, which makes it possible to travel both forward and backwards in time. A forward shift operator can be constructed a long the same lines. Thus, moving forward n observations in the series from an observation at time t is done by L n xt = xt+n : The properties of the lag operator implies that we can write an autoregressive expression of order p (AR(p)) as, a0 xt + a1 xt 1 + a2 xt 2 + ::: + ap xt p 2 = a0 xt + a1 Lxt + a2 L xt + ::: + ap Lp xt = (a0 + a1 L + a2 L2 + ::: + ap Lp )xt = A(L)xt : (7.6) Notice that the lag operator can be moved across the equal sign. The AR(1) model, xt = a1 xt 1 + "t can be written as (1 La1 )xt = "t or A(L)xt = "t or 1 xt = [A(L)] "t . If necessary the lag length of the process can be indicated as Ap (L): An ARM A(p; q) process can be written compactly as, Ap (L)xt = Bq (L)"t : (7.7) Skipping the indication of lag lengths for convenience, the ARMA model can 1 written as xt = [A(L)] B(L)"t or alternatively depending on the context as 1 [B(L)] A(L)xt = "t : Thus, the lag operator works as any mathematical expression. However, whether or not moving the lag operator around results in a meaningful expression is associated with the principles of stationarity and invertibility, know as duality. 7.1.6 Generating Functions The function A(L) is a convenient way of writing the sequence. More generally we can refer to any expression of the type A(L) as a generating function. This includes the mean operator, the variance and covariance operators etc. Generating functions summarize a lot of information about sequences in a compact way and are an important tool in time series analysis. Their main advantage is that they saves time and make the expressions much simpler since a number mathematical operations can be applied to generating functions. As an example, given certain P conditions concerning the sum ai ; we can write invert A(L); and A(L) 1 A(L) = 1: The generating function for the lag operator is D(L) = k X di z i ; (7.8) i where di is generated by some other function. The point here is that it is often easier to do manipulations on D(L) directly than on each individual element in the expression. In the example above, we would refer to A(L)xt as the generating function of xt . A property of generating functions is that they are additive. If we have two series, ai , bi and i = 0; 1; 2; :::, and de…ne a third series as ci = ai + bi , it then follows that, C(L) = A(L) + B(L): (7.9) 68 INTRODUCTIOO TO TIME SERIES MODELING Another property is that of convolution. Take the series ai and bi from above, a new series di can then be de…ned by, i di = a0 bi + a1 bi 1 + a 2 bi 2 + ::: + ai b0 = X ah bi h: (7.10) h=0 In this case we write D(L) as, D(L) = A(L)B(L): (7.11) The results stated in this section should be compared with chapter 19, below, which shows how long-run multipliers, etc. can be derived from lag operator. 7.1.7 The Di¤erence Operator Given the de…nition of the lag operator (or the backward shift operator) the difference operator ( ) is de…ned as, =1 L; (7.12) which for a variable xt leads to xt = (1 L)xt = xt xt 1 . Notice that in time series statistics the di¤erence operator are usually denoted with r. In practice the -symbol denotes taking …rst di¤erences of discrete variable. For a continuous variable taking …rst di¤erencing implies taking the derivative with respect to time. If x(t) is a continuous time stochastic variable, Dx = dx=dt; (7.13) where D = d=dt. Di¤erences of higher order are denoted in the same way as for the lag operator. Thus for the second di¤erence of xt we write, 2 xt = (1 L)2 xt = (1 2L + L2 )xt = xt 2xt 1 + xt 2: (7.14) Higher order di¤erences are given as d xt = (1 L)d xt : Notice the di¤erence between the di¤erence operators d xt and s xt : The …rst is the conventional di¤erence operator, the second is the seasonal di¤erence operator, such that s xt = (1 Ls )xt = xt xt s: The subscript s indicates the interval over which we take the (seasonal di¤erence). If xt is quarterly, setting s = 4, leads to the yearly changes in the series. This new series can the be di¤erenced by using the di¤erence operator, d s xt = DESCRIPTIVE TOOLS FOR TIME SERIES d (1 Ls )xt : 69 7.1.8 Filters The generating functions takes us to the concept of …lters. If xt is an AR(p) then the autoregressive part of this model can be though of as a …lter such that if we multiply xt with Ap (L) the result is a white noise process. In the same way, given a white noise series et and some …lter B(L), B(L)et = yt ; generates the series yt : Alternatively, think of S(t) as the seasonal component of the series xt ; or in other words the seasonal …lter. Multiplying xt with S(t); or in a linear relation subtract S(t)xt from xt ; and the outcome is a deseasonalised variable. Thus, in this context the term …lter is a broad concept, that indicates that we can transform series in di¤erent ways. From white noise we can produce ARIMA processes, or we can extract certain components out of a series. 7.1.9 Dynamics and Stability Given the parameters of an autoregressive process we may ask if the process is stationary or not. Starting from a steady state solution, will a shock to the process, given its parameters, result in an explosion of the series, in in…nite growth or in a temporary deviation from steady state? The answers to these questions are given by analysing the roots of the polynomial given by the autoregressive process A(L). An autoregressive process can always be expressed as a stochastic di¤erence equation, and we can deal with in the same way as with a normal di¤erence equation. Starting from A(L)yt = "t , withdraw yt 1 from both sides leads to the di¤erence equation, yt = A (L)yt 1 + "t . The solution of this equation is, yt = yp + yc ; (7.15) where yp represents the particular solution, the long-run steady state equilibrium or the stationary long-run mean of yt ; and yc represents the complementary solution, the deviation from the long-run steady state. Dynamic stability requires that yc vanishes as T ! 1: The roots of the polynomial A(L) tell us if this occurs. Given a change in "t ; what will happen to yt+1 ; yt+2, ... yt+1 ? Will yt+1 explode, continue to grow for ever, or change temporary until it returns to the steady state equilibrium described by yp ? The roots are given by solving for the r : s in the following equation, r p + a1 r p 1 + a2 rp 2 + ::: + ap = 0: (7.16) This equation leads to the latent roots of the polynomial. The condition for stability, when using the latent roots, is that the roots should be less than unity, or ”that the roots should be inside the unit circle”. Root equal to unity, so called unit roots, imply an evergrowing series (stochastic trend), roots greater than unity implies an explosive process. Complex roots suggest that the adjustment is cyclical . Though not very likely, the process could follow an explosive cyclical path or cyclical permanent shocks. If the process is stationary, following a shock, yt will return to its stationary long-run mean. The roots can be complex indicating cyclical behavior. The case with one or several ”unit roots”is of particular interest because it represents stochastic growth in a non-stationary variable. Series with one or more unit roots are also called integrated series. Many economic time processes appears to have a unit root, or roots close to unity. Using latent roots to de…ne stability is common, but is not only way to de…ne stability. Latent roots, or eigenvalues, are motivated with the fact that they are 70 INTRODUCTIOO TO TIME SERIES MODELING easier to work with when matrix algebra is used. An alternative way of de…ning stability is to solve for the roots ( ) in the following equation, 1 a1 + a2 2 + ::: + ap p =0 (7.17) where : If the roots are greater than unity in absolute value j j> 1, ”lies outside the unit circle”the process is stationary, if the roots are less than unity the process is explosive. The historical literature on time series uses both de…nitions, however, latent roots, or eigenvalues are now the established ”standard”. 7.1.10 Fractional Integration 7.1.11 Building an ARIMA Model. The Box-Jenkin’s Approach The Box-Jenkin’s approach is a practical way …nding a suitable ARMA representation of a given time series. The steps are 1) Identi…cation. Determine: (i) if seasonal di¤erencing is necessary to remove seasonal factors, (ii) the number times the series need to be di¤erenced to achieve stationarity and iii) study ACF and PACF to determine suitable order of the ARMA process. 2) Estimation. The identi…cation step leads to (1) stationary series and (2) narrows the possible ARMA(p,q) process of interest to estimate. Methods of estimation? Remember problems with t-values?! 3) Testing. Test the estimated model(s) for white noise residuals, using Box-Pierce test for autocorrelation. Among models with white noise residuals pick the one with the smallest information criteria (AIC, BIC). Di¤erences among information criteria? This leads quickly to a forecast model, or a representation for expectations generating mechanism that can be used in simple (rational) expectations modeling. Limitations of univariate ARIMA models. Most economic problems are multivariate. Variables depend on each other. Furthermore, the test procedure is only aimed at …nding a forecast model. To build an econometric model that can be used for inference the demands for testing are higher. 7.1.12 Is the ARMA model identi…ed? The parameters of an ARMA model might not be unique. To see the conditions for uniqeness, decompose the polynomials of the ARMA process A(L)yt = B(L)"t into their factors6 as, A(L) = p i=1 (1 i L); (7.18) B(L) = q j=1 (1 j L): (7.19) and 6 If A(L) contains the polynominal 1 L the process is said to have a unit root. DESCRIPTIVE TOOLS FOR TIME SERIES 71 For a unique representation of the ARMA process there should be no ”common factors”, like (1 (1 m L) k L): If this is the case, it is possible to take any other polynomial C(L) of …nite order (< p), and multiply both sides of the ARMA process such that, C(L)A(L)yt = C(L)B(L)"t ; (7.20) leads to A (L)yt = B (L)"t : (7.21) Thus, in the case of a common factor there is no unique representation of the parameters in A(L) and B(L). 7.2 Theoretical Properties of Time Series Models 7.2.1 The Principle of Duality There is a link between AR and M A models, as the presentation of the lag operator indicated. An AR process with an in…nite number of lags can under certain conditions be rewritten as a …nite M A process. In a similar way an in…nite moving average process can be inverted to an autoregressive process of …nite order. These results have two practical implications. The …rst is that in practical modelling, a long M A process can often be rewritten as a shorter AR process instead, and the other way around. The second implication is that the two process are complementary to each other. The combination of AR and M A into ARM A will lead to relatively parsimonious models meaning models with quite few parameters. In fact, it is quite uncommon to …nd ARM A models above the order p = 2 and q = 2. The AR(1) process, yt = a1 yt 1 + "t , can be written as (1 a1 L)yt = "t , and in the next step as yt = (1 a1 L) 1 "t : The term (1 a1 L) 1 represents the sum of an in…nite moving average process, yt = 1 X 1 bi " t "t = (1 a1 L) i=0 i = B(1)"t ; where b0 = 1: In the same way, a M A(1) process P1 yt = "t b0 "t 1 , can be written as an in…nite autoregressive AR process; i=0 ai yt i = A(1) = "t : These transformations can be generalized for AR(p) and M A(q) processes, as well as for vector processes. The question is, when are these transformations meaningful? An AR process can always be inverted, but it will only have (a meaningful) summable M A process if it is stationary. Another way to state this condition is to say that the (latent) roots of A(L) = 0 should be less than unity (inside the unit circle). An M A process, on the other hand, is always stationary, since the "t by de…nition is a stationary process. However, a MA process can only be inverted if the latent roots of the polynomial B(L) = 0 are less than unity, the roots are inside the unit circle. (Notice that we refer to the latent roots, if we switch to the ’ordinary’roots the requirement is that they should be outside the unit circle, larger than one. See this paper for de…nitions of inside and outside the unit circle!) Thus, a M A is always stationary, but only invertible if the latent roots of B(L) are inside the unit circle. An AR process is always invertible to an in…nite 72 INTRODUCTIOO TO TIME SERIES MODELING M A process but only stationary if the latent roots of A(L) are inside the unit circle. The latter has one interesting implication, it is often convenient to rewrite an AR or a V AR to a moving average form and investigate the properties and consequences of non-stationary from the M A representation. The conditions are similar, and actually more general, for a multivariate processes, such that V AR(p) () M A(q): 7.2.2 Wold’s decomposition theorem Linear ARIMA models are reasonable good approximations to many empirical time series processes. A theoretical result which suggests why ARIMA models are useful approximations is o¤ered by Wold’s decomposition theorem, Wold (1954). The theorem says that any covariance stationary process can be uniquely represented as the sum of two uncorrelated process, xt = dt + yt , where dt is a linearly deterministic process, and yt is an in…nite moving average process, MA(1): Thus, we can write xt as 1 X xt = dt + bj e t j ; j=0 P1 2 where b0 = 1, and et is stationary (white noise) such that j=0 bj < 1; E(et ) = 0; E(e2t ) and E(et ; et j ) = 0 for j 6= 0. The theorem has two implications. The …rst is that any series which appears to be covariance stationary can modeled as an in…nite MA process. Given the principle of duality, we can expect to …nd a …nite autoregressive process as well (compare with the principle of duality). Since many economic time series are covariance stationary after …rst di¤erencing, we expect ARMA models as well as linear autoregressive distributed lag models, to work quite well for these series. The second implication is that we should be able to extract a white noise process out of any covariance stationary process. This leads to the conclusion that …nding (or constructing) a white noise process in an empirical model is a basic necessity in the modeling process because most economic time series are covariance stationary after di¤erencing. The presentation above has focused on the practical side of time series modelling. time series can be described and analysed theoretically. Consider the AR(1) model yt = a1 yt 1 + "t . The series yt is generated by the parameter a1 , the white noise process "t and some initial value at the beginning of time say t = 0, y0 : Thus, given an initial value, a parameter a1 and random number generator that generates "t N (0; 2 );where we for simplicity can set to 2 = 1, it becomes possible to generate possible series of yt using Monte Carlo technique. The different outcomes of the series yt can then be used to estimate the distribution of a ^1 to learn about how to do inference in small and medium sized samples, and to understand the distributions as a1 ! 1:0: We can also calculate the mean and the variance of yt . The series yt is not independent, since it is a autoregressive. Therefore, the mean and the variance of the observed yt is not informative for describing the series. Instead look at the mean of the zero mean (no constant) AR(1) process, in the form of the expected value; E(yt ) = E(a1 yt 1 ) + E("t ): Looking at the expression, the left hand side tells us that the right hand side represents the mean of yt : The expected value of a white noise is de…nition zero, so E("t ) = 0. Since a1 is a given constant we have for the other factor, E(a1 yt 1 ) = a1 E(yt 1 ). To …nd an answer we need to substitute the lags of yt 1 ; yt 2 ; etc. THEORETICAL PROPERTIES OF TIME SERIES MODELS 73 For the …rst lag, by substitution, we get a1 E(yt 1 ) = a1 E(a1 yt 2 + "t 1 ) = a21 E(yt 2 ): Substitute one more time, a21 E(yt 2 ) = a21 E(a1 yt 3 +"t 2 ) = a31 E(yt 3 ). As we continue substituting backwards we will end up with the initial value. Later we will examine the case of minus in…nity. Since the initial value can be seen as a constant, we get as the …nal product at1 E(y0 ) = at1 y0 . (Recall that the expected value of a constant is equal to the constant.) If the initial value is set to zero it follows that at1 y0 = 0, and that the mean of yt , is E(yt ) = at1 E(y0 ) = 0. It is standard to assume that the initial value is zero in this type of analysis. What happens if yt has a mean di¤erent from zero, and if the initial value is di¤erent from zero? The answer is simple as long as we can assume that the AR process is stationary and therefore the initial value is a constant there are no problems. Under these conditions, a non-zero mean can be represented by a constant parameter in the AR process, such as yt = 0 + a1 yt 1 + "t . The expected value of yt is E(yt ) = E( 0 ) + a1 E(yt 1 ) + E("t ), which mean that the right hand side is 0 + a1 E(yt 1 ). Again we need to substitute backwards leading to; 0 + a1 E( 0 + a1 yt 2 + "t 1 ) = 0 + a1 0 + a21 E(yt 2 ). The next substitution gives, +a1 +a21 +a31 E(yt 3 ). If we continue substituting back to minus in…nity, and set the initial value to zero, we get, E(yt ) (1 + a1 + a21 + a31 + :::) 1 X ai1 = = i=0 = (1 a1 ) The last step is simply an application of the solution to an in…nite series, which works in this case as long as the AR process is stationary, ja1 j < 1. It is important that you understand the use of the expectations operator in this example because the technique is frequently used to derive a number of results. We could have reached the result in a simpler way if we had used the lag operator. Take the expectation of E(1 a1 L)yt = E( + "t ). The lag operator is a deterministic factor why the result is E(yt ) = (1 a1 L) : Again, the left hand side is the sum of an in…nite process. If there is no constant, = 0 it follows immediately that E(yt ) = 0: What is the variance of the process yt ? The answer is given by understanding that E(yt yt ) = V ar(yt ) = 2 :Thus, start from the AR(1) process, multiply both sides with yt to get yt yt = a1 yt yt 1 + yt "t . Next, take expectations of both sides, E(yt yt ) = a1 E(yt yt 1 ) + E(yt "t ), and substitute yt yt 1 and yt "t as (a1 yt 1 + "t )yt 1 = a1 yt2 1 + "t yt 1 and yt "t = (a1 yt 1 + "t )"t . From this we have a21 E(yt2 1 ) and a1 E("t yt 1 ) + E("2t ):In the latter expression we have by de…nition that E("t yt 1 ) = 0 (recall the basic assumptions of OLS) and that E("2t ) = 2" . Put the results together, E(yt yt ) 2 2 (1 = a21 E(yt2 1 ) + = a21 a21 ) = 2 = 2 + 2 " 2 " 2 " 2 " (1 a21 ) The technique is the same for any AR(p) process. From the calculation of the variance we can also see the value of the autocovariance and the autocorrelation coe¢ cients, say at lag k. Multiply both sides of 74 INTRODUCTIOO TO TIME SERIES MODELING the process with yt k and solve E(yt yt this follows that the autocovariance is k = ak1 k) = a1 E(yt 2 " (1 a21 ) 1 yt: k ) + E("t yt k ): From : The autocorrelation is simply k = k = ak1 : From this expression it is obvious that the autocorrelation function for the AR(1) process dies out slowly as the lag length k increases. Calculating the mean, variance, autocovariances and autocorrelations for AR(1), AR(2), MA(1) and MA(2) processes are standard exercise in time series courses, followed by investigation of the unit root case a1 = 1: To be completed... 7.3 Additional Topics 7.3.1 Seasonality Seasonality is an inherent characteristic of most time series data. Seasonality can be dealt with in three ways. The …rst is to use seasonal dummy variables. The second method is to use seasonal di¤erencing. And, the third method is to use a program called X12. (Previously X11) All methods su¤ers from the fact that e¢ cient estimation of seasonal e¤ects requires a lot of data observations, which is rare in most applied econometric time series work. Econometricians tend to use seasonal dummies, since they are easy to use and leads to a transparency in the model. Seasonal di¤erencing is the standard method in the Box-Jenkins approach. For a quarterly series seasonal di¤erencing implies di¤erencing in the following way; (1 L4 )yt = yt yt 4 = 4 yt . The corresponding operator for monthly data is (1 L12 ). In econometrics, the assumption of seasonal unit roots are di¢ cult to test. There are few clear cut examples of such processes in the literature and the test for seasonal unit roots are quite complex, especially given the limited samples in econometrics. Thus, econometricians tend to use seasonal di¤erencing when dummy variables do not work. Otherwise including lags at seasonal frequencies will usually take care of seasonal e¤ects. Finally, X12 can be described as a state of the art tool, or as a black box, where you send in seasonal data and out comes a desasonalised series. X12 is a respected method to use, and is frequently used to deseasonalised public statistics. procedure, or some similar program to remove seasonality. Removing seasonality by seasonal di¤erencing, seasonal dummies or by using X12 do not a¤ect the presence of one or more unit roots in the series. The Dickey-Fuller test or other tests for unit root works as before. X12 is a program designed for univariate analysis, meaning that if seasonality is removed in single series by X12 prior to modeling a system, seasonality can still be left in a multivariate single equation model or in a system of equations. The problem with X12 is its black box nature, the econometrician losses some control over the modeling process. Some care in the use of X12 is recommended. ADDITIONAL TOPICS 75 7.3.2 Non-stationarity (To be completed) Di¤erencing until stationarity is the standard Box-Jenkins approach. A bit ad hoc. In econometrics the approach is to test …rst, but only reject the null of integrated of order one in the case of strong evidence against. Alternatives, include linear deterministic trends, polynomial trends etc. Dangerous, spurious detrending under the maintained hypothesis of integrated variables. 7.4 Aggregation The following section o¤ers a brief discussion about the problems of aggregation. The interested reader is referred to the literature to learn more [Wei (1990 is a good textbook with many references on the subject, see also Sjöö (1990, ch. 4]. Aggregation of series means aggregation over agents and markets, or aggregation over time. The stock of money, measured by (M3), at the end of the month represents an aggregation over individuals. A series like aggregate consumption in the national accounts, represents an aggregation over both individuals and time. Aggregation over time is usually referred to as temporal aggregation. Money holdings is a stock variable which can be measured at any point in time. Temporal aggregation of a stock variable implies picking observations with larger intervals, using say a money series measured at the end of a quarter, instead of at the end of each month. Consumption, on the other hand is a ‡ow variable, it cannot be measured at a point in time, only as the sum of consumption over a given period. Temporal aggregation in this case implies taking the sum of consumption over intervals. The distinction is of importance because the e¤ects of temporal aggregation are di¤erent for stock and ‡ow variables. Aggregation, both over time and individuals, can change the functional form of the distribution of the variables, and that it can a¤ect the residual variance and t-values. Exactly how aggregation changes a model varies from situation to situation. There are however some general conclusions regarding temporal aggregation which we will repeat in this section. In many situations there is little we can do about these problems, except working with continuous time models, or=and select series with a low degree of temporal aggregation. That the problem is hard to deal with is no excuse for forgetting or hiding them, as it is done in many text books in econometrics. The area of aggregation is an interesting challenge for econometricians since it has not been explored as much as it deserves. An interesting example of the consequences of aggregation is given in Christiano and Eichenbaum (1987). They show how one can get extremely di¤erent results by using discrete time models with yearly, quarterly and monthly data compared with a continuous time model. They tried to estimate the speed of adjustment in the stock of inventories, in the U:S national accounts. Using a continuous time model they estimated the average time for closing 95% of the gap between the desired and the actual stock of inventories, to be 17 days. The discrete models predicted much higher rates. Using monthly data the result was 46 days, with quarterly data 7 months, and with yearly data 5 (1=2) year! Aggregation also becomes an important problem if we have a theory that describes the stochastic behavior of a variable which we would like to test with empirical data. There are many results, in macro and …nance, that predict that series should follow a random walk, or be the outcome of a martingale process. 76 INTRODUCTIOO TO TIME SERIES MODELING There are several factors to consider if we like to estimate a process suggested by theory. An example is Hall (1978) who, from a life cycle hypothesis, derived that private consumption should follow an AR(1) process, and be a random walk under the assumption of rational expectations. The …rst factor, is that of temporal aggregation. An additional complication are adjustment costs, which will also a¤ect the “original” model. If private consumption, as an example, is de…ned as an AR(1) model, temporal aggregation changes it to ARMA(1,1), the existence of adjustment costs will then transform it to an ARMA(2,1) model. Temporal aggregation, adjustment costs and measurement errors are factors which can a¤ect the structure of the model and the size of estimated parameters. To this list one could also add problems of seasonal factors, trends and “hidden periodicity”. The latter is a problem, because the larger the temporal aggregation the more di¢ cult it is to get a correct estimate of parameters that re‡ect cycles which are not timed with the sampling interval. Therefore, one should be critical of papers which try to prove that some empirical series behaves like a theoretical process. Is it possible for the author to control all of these factors? For a ‡ow variable with an ARIMA representation, the outcome of temporal aggregation depends on hidden periodicity, which if it exists can a¤ect both the AR and the MA process. In general, aggregation will complicate the structural of the ARIMA model. A simple AR model becomes an ARMA model. But, as aggregation becomes larger the structure of the model becomes simpler. For a stock variable the consequences are clearer. An ARIMA(p; d; q) process of a stock variable, becomes after temporal aggregation an ARMA(p; d; s) process, where s integer [(p + d) + (q p d)=m]; and where m is the degree of temporal aggregation, or in other words the systematic sampling interval. As a rule of thumb it can be assumed that temporal aggregation adds +1 to the MA process. Since di¤erencing is a form of temporal aggregation, taking higher and higher di¤erences of a series will create an MA process. This can be seen in any time series program that produces ACF:s and PACF. The more one di¤erences a series the more clearly will the series look like an MA process.. Thus, it follows that observing an MA process in the Identi…cation step in the Box-Jenkins approach, is a sign of over-di¤erencing. The expression holds even for an ARMA model where d = 0. For an ARIMA model, as m gets larger, the model turns towards an IM A(d, d 1) process. Thus, we end up with a random walk model. This is an interesting result for of two reasons. First, since the random walk model often seems to …t macroeconomic and especially …nancial time series quite well, could that be the outcome of having too large sampling intervals? Second, the result explains the …ndings in Christiano and Eichenbaum (1987), that larger sampling intervals lead to slower and slower adjustment speed in inventories. The larger the sampling interval, the more did inventories seem like a random walk. As a consequence, the more important seemed historical shocks, further and further back in history. In the end, in the random walk model, all historical shocks have the same importance and there would be no adjustment at all. Temporal aggregation will also a¤ect prediction. The general result is that aggregation reduces the e¢ ciency of the forecasts, and that the relative loss of e¢ ciency is larger for a non-stationary series than a stationary one. (Remember that most macroeconomic series are non-stationary.) It is also worth mentioning some conclusions concerning causality. Aggregation will not a¤ect the direction of causality, if there is a clear causality from one variable to another, when dealing with stock variables. It will, however, weaken the AGGREGATION 77 estimated strength of the relationship and can therefore lead to wrong conclusions from Granger non-causality tests. For ‡ow variables, on the other hand, temporal aggregation turns a one direction causality into what will appear to be a two-sided causality. In this situation a clear warning is in place. ~ t and Y~t . Finally, we also look at the aggregation of two random variables, X Suppose that they are two independent stationary processes with mean zero, ~ t j yt ] = E[Y~t j xt ] = 0: E[X (7.22) ~ t and Y~t are, The autocovariances of X cov(xt 1 ; xt k ) = x ;k ; (7.23) cov(yt 1 ; yt k ) = y ;k : (7.24) ~ t and Y~t is, The sum of X ~ t + Y~t ; Z~t = X (7.25) which will have an autocovariance equal to, z ;k = x ;k + y ;k : (7.26) In general, we can write this in the following way, if ~t X ARM A(p; m); (7.27) Y~t ARM A(q; n): (7.28) and and, ~ t + Y~t ; Z~t = X (7.29) then, Z~t ARM A(x1 ; x2 ); (7.30) where x1 p + q and x2 max(p + n, q + m). As an example, think of a series which is measured with a white noise error. That is, the true series is added to a white noise series. If the true series is AR(p) then the result of this aggregation will be an ARMA(p; p) process. We can conclude this section by stating that aggregation leads to loss of information, which, if the aggregation is large, might fool us into assuming that the random walk is the appropriate model. The extent to which aggregation leads us to wrong conclusions has not been stated yet. Partly this is so because we need better data on shorter time intervals than what is available. Remember that ignoring problems is not a way of solving them. One way of dealing with the problems of aggregation is to use continuous time econometric techniques instead, see Sjöö (1993) for a discussion and further references. 7.5 Overview of Single Equation Dynamic Models The autoregressive process represent a basic way of modeling time series. As complexity and multivariate processes are introduced the AR model transform into a system of equation, where it becomes possible to give the parameters a structural (economic) interpretation. In principal, we have the following types of equation models, where t NID(0, 2 ). 78 INTRODUCTIOO TO TIME SERIES MODELING 1. Autoregressive models: AR(p) : A(L)yt = t; 2. Moving average models: M A(q) : yt = (L) t; 3. ARM A(p; q) models: A(L)yt = (L) t ; (+ARIM A) 4. Distributed lag models: DL(p) : yt = B(L)xt + t; 5. Autoregressive distributed lag models: ADL(p) : A(L)yt = B(L)xt + t 6. ARMA model with exogenous explanatory variable ARMAX (ARIMAX): A(L)yt = B(L)xt + (L) t ; 7. Rational distributed lag model RDL: yt = 8. Transfer function: yt = B(L) A(L) xt + B(L) A(L) xt + (L) t (L) (L) t Notice that the transfer function is also a rational distributed lag since it contains a ratio of two lag structures. Also, (7) and (8) can be viewed as distributed lag models since D(L) = [B(L)=A(L)]. Notice that rational distributed lag models require some information about B(L) to be workable. Imposing restrictions on the lag structure B(L) in distributed lag models lead to further models; 9. Geometric lag structure (= Koyck), where B(L) is assumed to decline according to some exponential function. 10. Polynomial distributed lag (PDL) models, where B(L) declines according to some polynomial function, decided a priori. (= Almon lags). 11. All other types of a priori restrictions on B(L) not covered by (9) and (10).7 12. The error correction model. This model embraces all of the above models as special cases. The following explains way this is so. Introduction to Error Correction Models Economic time series are often non-stationary, their means and variances change over time. The trend component in the data can either by deterministic or stochastic, or a combination of both. Fitting a deterministic trend assumes that the data series grow with a …xed rate each period. This is seldom a good way of characterizing describing trends in economic time series. Instead they are better described as containing stochastic trends with a drift. The series might be growing over time, but it is not possible to predict whether it grows or declines in the next period. Variables with stochastic trends can be made stationary by taking …rst di¤erences. This type of variable is called integrated of order 1, where the order of integration is determined by the number of times the variable needs to be di¤erenced before it becomes stationary. A necessary condition for …tting trending data in an econometric model, is that the variables share the same trend, otherwise there is no meaningful long-run relationship between them.8 Testing for co-integration is a way of testing if the data 7 Restrictions are put on the lag process to make the estimation more e¤ective. A priori, restrictions can be motivated by a limited sample and muticollinarity that a¤ects estimated standard errors of the individual lags. These type of restrictions are not used anymore. Today, it is recognized that it is more important to focus information criteria, white noise residuals and building a well-de…ned statistical model, instead of imposing restrictions that might not be valid. 8 The exception is tests of the e¢ cient market hypothesis, and related tests of rational expectations. See Appendix A in Sjöö and Sweeney (1998) and Sjöö (1998). OVERVIEW OF SINGLE EQUATION DYNAMIC MODELS 79 has a common trend, or if they tend to drift apart as time increases. The simplest way to test for cointegration is the so called Engle and Granger two step procedure. The test implies determining whether the data contains stochastic trends, and if so, testing if there are common trends. If xt and yt are two variables, with non-stochastic trends that become stationary after …rst di¤erencing, cointegration can be tested by running the following co-integrating regression, yt = + xt + t : (7.31) If both yt and xt are integrated variables of the same order, a necessary condition for a statistically meaningful long-run relationship is that the residual term ( t ) is stationary. If that is the case the error term from the regression can be seen as temporary deviations from the long-run, and and can be viewed as estimates of the long-run steady state relation between x and y. A general way of building a model of time series, without imposing ad hoc a priori restrictions, is the autoregressive distributed lag model. For two variables we have, A(L)yt = B(L)xt + t ; (7.32) Pk Pk where the lags are given by A(L) = i=0 ai , and B(L) = i=0 bi . The …rst coe¢ cient in A(L) is set to unity, a1 = 1. The lag length is chosen such that the error term becomes a white noise process, t N ID(0, 2 ). The long-run solution of this model is given by, yt = xt + t; (7.33) where = B(L)=A(L). Without loss of generality we can use the di¤erence operator, xt = xt xt 1 , to rewrite the autoregressive model as an error correction model, k k X X yt = x + yt i + ECMt 1 + t ; (7.34) t i i i i=0 i=1 where the error correction mechanism is given by ECMt 1 = ( xt 1 yt 1 ). The latter term can be said to represent the deviation from the long run steady state relation between the two variables. It is convenient to think of the ECMt variable at the …rst lag, controlling the long-run path of the dependent variable. Asymptotically it will not matter at which lag the ECMt is placed. Though in a multivariate model, and for a …nite sample, it might make a di¤erence, a seasonal lag on might work better.9 Furthermore, for an ECM to work well in a model it should nor display any signs of seasonal e¤ects or extreme outliers. These e¤ects should be removed when the ECMt is constructed. The -parameter of the error correction term indicates how changes in yt react to deviation from the long-run equilibrium. When modeling integrated variables, rewriting the system as a (vector) error correction model is a natural step. However, error correction models works with stationary data series. Assuming costly adjustment leads generally to partial adjustment models, that are better written in the less restrictive error correction form. Optimal control theory, approximations to structural systems in continuous time etc. will also lead to error correction models, see Hendry, Pagan and Wickens (1982), Hendry (1995), or ch. 2 in Banerjee et. al. (1993). If xt and yt contain stochastic trends it is necessary that they are co-integrated for the ADL model to make sense in the long-run. For instance, if the variables are co-integrated, the error term from the co-integrating regression ( t above) can be used as the error correction mechanism. This was shown in Engle and Granger 9 For 80 comparison see the discussion of seasonality, earlier in this paper. INTRODUCTIOO TO TIME SERIES MODELING (1987). If there is cointegration there is an ECM formulation, the reason being that cointegration implies Granger causality in at least one direction. The advantage of the error correction model is that it does not put a priori restrictions on the model and that it separates long-run and short run e¤ects. It has proven to be a very e¢ cient way to model various economic models, like money demand, consumption etc. It should be recognized that the early literature on EC models tended to oversee the problem of weak exogeneity. With the developments in the …elds of multivariate cointegration it has been shown that when the same EC expression determines more than one variable, there are cross equation restrictions between the co-integrating parameters. These restrictions imply that error correction expressions have to be estimated within complete systems, not from OLS. Multivariate Model Survey Multivariate models are introduced later. For the time we can conclude our listing of models with the following, Vector autoregressive models V AR: Vector autoregressive moving average processes V ARM A: The V AR and the V ARM A represents multivariate ARIMA models. Vector error correction models V ECM s: Structural vector autoregressive models SV AR: Systems of structural equations estimated using estimators. Structural vector error correction models The latter represent the …nal step, where a complete system of interactive variables are modeled and given en (economic) structural interpretation. OVERVIEW OF SINGLE EQUATION DYNAMIC MODELS 81 82 INTRODUCTIOO TO TIME SERIES MODELING 8. MULTIPLIERS AND LONG-RUN SOLUTIONS OF DYNAMIC MODELS. Given an autoregressive, or distributed lag structure A(L), B(L) or D(L) the long run static solution of the model is found by setting L = 1. The intuition is that in the long run there will be no changes in the explanatory variables, and it will not matter if we explain yt by say xt and/or xt i . The conditional mean of yt in an ADL model for example is 1 Et fyt g = yt = A(L) B(L)xt : (8.1) The mean path of yt is therefore y= B(1) x; A(1) (8.2) where A(1) = (1 a1 a2 ... ar ) and B(1) = b0 + b1 + b2 + ::. +bj ). In a distributed lag model we would have y = D(1)x (8.3) Now a unit change (easier if in percent) in x leads to a new equilibrium, y = D(1)(x + 1): (8.4) The total e¤ect of a change in xt is given by the sum of the coe¢ cients in D(L) when L = 1. If there are m lags in D(L), the total multiplier is D(1) = ( 0 + 1 + 2 + ::: m) = m X j: (8.5) j=0 It is also possible to think of the total multiplier as an in…nite sum of variables which dies out slowly in the long-run. The impact multiplier is associated with the …rst parameter in D(1), which is 0 . Thus taking 0 xt gives you the impact multiplier, the …rst periods e¤ect following a change (a chock) in xt : The j : th interim multiplier ( j ) is the sum of the coe¢ cients up and including the j : th lag, j X = (8.6) j: j j=0 It is common to standardize the j : th interim multiplier in the following way, j m X =[ j ]= D(1); (8.7) j=0 such that it represents the share of the total multiplier up until the j : th lag. MULTIPLIERS AND LONG-RUN SOLUTIONS OF DYNAMIC MODELS. 83 The mean lag is given as, m X =[ j j ]=[ j=0 m X j ]; (8.8) j=0 Notice that m could be equal to in…nity if we have a stable model, with stationary variables, such that the in…nite sum of i converges to a constant sum in the long run. The mean lag can be derived in a more sophisticated way, by di¤erentiating D(L) with respect to L and then dividing by D(1). That is, D(L) = 0 + 1L D0 (L) = 1 +2 + 2L 2 2L +3 + ::: + 2 3L s sL ; + ::: + s and s 1 : sL (8.9) (8.10) 0 By dividing D (1) by D(1) we get, as a general result for ADL models, = D0 (1) D(1) B 0 (1) B(1) A0 (1) A(1) (8.11) Finally we have the median lag, representing the number of periods required for 50% of the total e¤ect to be achieved. The median lag is obtained by solving, j j X [ j ]= D(1) = 0:50: (8.12) j=0 Sometimes the median lag is approximated by choosing the j : th interim multiplier in the middle of the lag structure. 84 MULTIPLIERS AND LONG-RUN SOLUTIONS OF DYNAMIC MODELS. 9. VECTOR AUTOREGRESSIVE MODELS The extension of ARIMA modeling into a multivariate framework leads to Vector Autoregressive (VAR) models, Vector Moving Average (VMA) models and Vector Autoregressive Moving /VARMA) models. In economics, since most variables display autocorrelation and are cross-correlated, VAR models are an interesting choice for modeling economic systems. Vector models can be constructed using similar techniques as those for single variable ARIMA models. The autocorrelation and partial autocorrelation functions can be extended to display cross-correlations among the variables in the system. However, when modelling more than two variables, these cross autocorrelation and cross partial autocorrelation functions quickly turn into complex matrix expressions for each lag.1 Thus, the crosscorrelation functions are not practical tools to work with. The advantages of using VARs are that as VAR represent a statistical description of the economy. When using ARIMA on univariate series, in many situations the combination of AR and MA processes turn out to be an e¢ cient way of …nding a stochastic representation of a process. VAR models are usually e¤ective in modeling multivariate systems, and can be used to make forecasts and dynamic simulations of di¤erent shocks to system. These shocks can come from policy, from productivity or anywhere in the economy basically, and the shocks can be assumed to transitory or permanent. The main complicating factor is that in order to understand what shocks and simulations actually mean it is necessary to identify the underlying economic relation among the variables. To make VAR models work for economic analysis it is necessary to impose some restrictions on the residual covariance matrix of the VAR. Thus, there is no free lunch here in terms of avoiding discussing causality and simultaneity problems. It is necessary to point out the latter because in the beginning of the history of VAR models it seemed like VAR models could be used without economic theory, but that was build on a misunderstanding. In econometrics the focus is on …nding a parsimonious VAR representation with N ID residuals. Let xt be an p dimensional vector of stochastic time series variables, represented as a the k : th order VAR model, xt = k X Ai x t i + et ; or i=1 A(L)xt = et (9.1) Pp where Ai is the matrix of coe¢ cients of lag number i; so A0 A = i=0 Ai ; where A0 is a diagonal matrix, et is a vector of white noise residual terms. Notice that all variables across all equations have the same lag length (k). This is so because it makes it possible estimate the system with OLS. If the lag order is allowed vary, the VAR must be estimated with the seemingly unrelated regressor method. A VAR model can be inverted into its VECMA form as 1 See Wei (1989) for a presentation of the Box-Jenkin’s technique in a multivariate framework. VECTOR AUTOREGRESSIVE MODELS 85 xt = 1 X Ci xt i = C(L)et i=1 The MA form is convenient for analysing the properties of a VAR and investigate the consequences of shocks to the system. Estimation, however, is usually done in the VAR form, and is straightforward since each equation can be estimated individually with OLS. The lag length (k) of the autoregressive process is chosen such that the estimated residual process, in combination with constants, trend, dummy variables and seasonals, becomes white noise process in each equation. The idea is that the lag length is equal for all variables in all equations. A is 3 2 VAR 3 of dimension p with a constant 2 3 2 2 second 3 order e1t a0 x1t 1 x1t 2 x1t 7 6 7 6 6 x2t 7 6 a1 7 a11 a12 a1p 6 x2t 1 x2t 2 7 6 e2t 7 7 6 7 6 +6 . 7 6 7 6 .. 7 = 6 .. 7 + . . .. .. a21 a22 a2p 4 5 4 .. 5 4 . 5 4 . 5 ept ap xp 1 xpt 2 xpt VAR models were strongly advocated by Sims (1980) as a response to what he described as ”incredible restrictions”imposed on standard structural econometric models. Up until the mid 80s, empirical time series econometrics was dominated by the estimation of ”text-book equations”. Researchers simply took an equation from theory, estimated it, and did not pay much attention to whether the model and the data actually …tted each other. Typically, dynamic lag structures where treated in a very ad hoc way. Sims argued that it would be better to …nd a statistical model, which described the data series and their interaction, as well as possible. Once the statistical model was there, it could be used to forecast and simulate the economy. In particular, it would according to Sims be possible to analyse the e¤ects of various policy changes. Sims’critique is related to the "Lucas critique". Lucas showed how in a world of rational expectations, it was not possible to understand estimated parameters in structural econometric models as (deep) structural behavior or policy parameters. Since agents form their behavior on plans building on forecasts of variables, not on historical outcomes of variables, the estimated parameters based on historical observation become a mixture of behavioral parameters and forecast generating parameters. Further, under rational expectations, econometric models could not be used to analyze policy changes, because a change in policy would by de…nition lead to a change in the parameters of the system. Sims therefore argued for VAR models as a statistical description of the economy, under given policy rules. The e¤ects of surprise changes in policy variables could then be analysed in the reduced form. VAR models represent the reduced form of an underlying structural model. This can be seen by starting from a general (but not necessarily identi…ed) structural model, and rewriting it in reduced form. As an example, start from the bivariate model, yt = 1 xt = + a11 xt + b11 yt 2 + a21 yt + b21 yt 1 + b12 xt 1 + b22 xt + 1t (9.2) 1+ 2t (9.3) 1 This system can be rewritten in reduced form by substituting for xt and yt on the RHS of the equations, 86 yt = 1 xt = + 2+ 11 yt 1 21 xt + 1+ 12 xt 1 22 xt + e1t (9.4) 1 + e2t : (9.5) VECTOR AUTOREGRESSIVE MODELS The equations form a bi-variate VAR model of order one. The residuals of the VAR model (the reduced form) contain the residuals and the parameters (a11 and a21 ) of the structural model. The reduced system can be estimated by applying OLS to each equation.2 The parameters of the VAR relate to the structural model as 11 b12 + b11 ; etc. 11 = 1 11 21 Thus, the parameters of the VAR are complex functions of some underlying structural model, and as such they are on their own quite uninteresting for economic analysis. It is the lag structure and sometimes it signs that are more interesting. The two residuals in this VAR are, 1t e1t = + 1 11 2t and 11 1t e2t = (9.6) 11 21 1 + 2t : (9.7) 21 11 These residuals are both white noise terms, but they are correlated with each other whenever the coe¢ cients 11 or 21 are di¤erent from zero. The generalization of structural system above, setting zt = fyt ; xt g; is Bzt = 0 + 1 zt 1 + t; (9.8) where 1 11 01 11 ; 0= and 1 = 1 02 21 If both sides of 9.8 is multiplied with B 1 the result is, B= 21 zt = B 1 = + 0 0 +B 1 1 zt 1 + et ; 1 zt 1 +B 1 12 : 22 t (9.9) where 0 = B 1 0 ; 1 = B 1 1 and et = B 1 t : This shows that the VAR model is a reduced form of an underlying structural model, where the structural dependence is ’hidden’in the covariance matrix of the error terms. VAR models are estimated in their AR form, A(L)yt = et . They can be inverted and analysed in their MA form, yt = C(L)et . Beside predictions, VAR models are used for three types of analysis; Granger non-causality tests, forecast error variance decomposition and impulse response analysis. Granger non-causality tests deserve a special chapter and is therefore discussed in a following chapter. The other two techniques are typical VAR methods that make use of the MA form. Forecast Error Variance Decomposition. The forecast variance errors are explained in terms of the history of each variable. This analysis will tell how strong is the in‡uence among the variables of the system. It tells us the proportion of movements in a sequence (of yi ) that is due to ’own’ shocks and the proportion due to shocks in other variables. If these other variables have little in‡uence on the investigated variable, they will contribute little to the forecast error variance. Variables that are exogenous, will have small e¤ects from other variables. 2 OLS is as e¢ cient as the seemingly unrelated estimator (SUR) in this case, because the equations contain the same explanatory variables. However, if we set some lags to zero and have a system with di¤erent lags in di¤erent equations, SUR will be a more e¢ cient estimator than OLS. VECTOR AUTOREGRESSIVE MODELS 87 Impulse response analysis. This is a graphic or numerical presentation of a simulation the system’s response to an unexpected shock in one variable in the system. A typical example is to study how the economy, and real GDP, reacts to an unexpected change in the money supply under the assumption of rational expectations. A typical questions to ask are how long does it take for a shock in yt or xt before it dies out, will there be an e¤ect at all, will it be positive or negative, will die out smoothly or through ‡uctuations? We can Pt ask if shocks in yt a¤ect xt etc. Let the MA form be yt = C(L)et = i=0 Ci et i ; where Ci is the matrix of coe¢ cients for lag i. In matrix form, for a two dimensional system, y1t y2t = t X i=0 c11;i c21;i c12;i c22;i e1i e2i : (9.10) Setting i = 0 gives the impact multiplier, C0 , the initialPe¤ect of a shock. The 1 matrix of total, or long-run multipliers, is given by i=0 Ci : The impulse response functions are given by C(j) where j = 0; :::t. Both the variance decomposition and the impulse response analysis require that the residual covariance matrix of the VAR is orthogonalized. This is so, because the errors et are dependent on each other through the B 1 matrix. Unless the residuals of the VAR is orthogonalized it will not be possible to identify a shock from as a unique shock coming from one speci…c variable.3 There are several ways of performing the orthogonalization of the residuals. (In the following we assume that the VAR is made up of stationary variables.) The idea is that restrictions must be put on the covariance matrix of the VAR. Cholesky decomposition. Cholesky decomposition represents a pure mathematical way to orthogonalize the residuals, which will depend on the ordering of the variables. It is custom to do several di¤erent decompositions, by changing the order of the equations in the model, show the sensitivity of creating orthogonalization in di¤erent ways. In terms of the residual covariance matrix, what the Cholesky decomposition achieves is to make the upper diagonal of the matrix zero. Assume a three dimension VAR, p = 3, and therefore a 3 3 covariance matrix, 2 2 3 12 13 11 P 2 5: = 4 21 23 22 31 32 3 33 The outcome of the decomposition is to create the following covariance matrix, 2 2 3 0 0 11 P 2 0 5: = 4 21 22 31 32 3 33 The problem for identifying the VAR and doing the impulse responses is that the covariance matrix is not diagonal. The Cholesky decomposition P builds on the fact that any matrix P with the property that PP0 = de…nes an orthogonal covariance matrix such that et = P 1 t becomes a diagonal matrix, et s (0; IN ):The ordering of the equations determines the outcome, and the causal ordering of the residual shocks. With N = 3, there are three possible orderings and outcomes, which can be more or less di¤erent. 3 Early VAR modelers did not recognize the need for orthogonalization. Thus papers from the …rst part of the 1980s must be read by some care. 88 VECTOR AUTOREGRESSIVE MODELS Set up a recursive system. Instead of letting the computer do all of the job, you can set up the matrix B 1 so that the residuals form a recursive system by deciding on an ordering of the equations that corresponds to the ordering and residual correlations created be the Cholesky decomposition. Thus, the residual in equation one is not a¤ected by the other two. (Meaning that x1t is not explained by x2t or x3t ) The second residual is only a¤ected by the …rst residual. And …nally, the last (third) residual is a¤ected by residual one and two. Econometric programs often includes Cholesky decomposition routines in combination with the analysis of VAR models. By changing the ordering of the equations it becomes possible to compare the e¤ects of di¤erent recursive ordering of the variables. The problem is that we are drowning in output as the dimension of the VAR increases. Structural Autoregressive models SVAR. If economic theory does not suggest a recursive ordering, use economic theory to impose restrictions on the B 1 matrix. This is called Structural Vector Autoregressive (SVAR) models.4 In practice the approach implies formulating a small structural (static) economic system for the residual process et : If yt is an p-dimensional system, the error covariance matrix contains a total of p2 parameters, leading to the estimation of p(p + 1)=2 or (p2 + p)=2 number of parameters, equal to the number restrictions necessary for the matrix B 1 : As an example for a 3 variable system, the error process could be set up as, e1t = e2;t = c21 1t e3t = c31 2t + c32 1t + 2t 2t + 3t ; (9.11) which happens to be a recursive ordering. Alternatively, the system could look like, e1t = e2;t = c21 e3t = c31 1t + c13 3t 1t + 2t 2t + 3t : (9.12) In both examples the number of restrictions imposed are equal to (32 + 3)=2:) = 6: Behind each equation is some reasoning about the plausible correlation among the variables at time t. In each equation there is one white noise residual term with an implicit parameter of unity, leaving three possible parameters (c1 ; c2 ; c3 ) to describe how the shocks in the errors are related. An more general framework for identifying the VAR is Azt = A0 + A1 zt 1 + B t; where contemporaneous correlations among the variables is captured by A and B takes care of correlations in the residual such that B t becomes diagonal. Once the error process is set up in such a way that the errors are orthogonal, it becomes possible to analyze the e¤ects of one speci…c shock on the system and 4 A fourth approach is o¤ered by Blanchard and Quah (1989), and builds on classifying shocks as temporary or permanent. This approach can be seen as an extension of the SVAR approach to processes including integrated variables with common trends. VECTOR AUTOREGRESSIVE MODELS 89 argue that the shock is unique coming only from that particular variable. Without orthogonalization the shock can be a mixture of e¤ects from di¤erent variables, and not ’a clean’shock. One controversy here is that it is up to the econometrician to identify and label the shocks as, for instance, demand or supply shocks. The basis for such labeling might not be strong. Further, by de…nition, the errors include, not only structural relations, but also everything that we do not know or understand about the system. For that reason it might be better to use economic theory to identify structural relations and build conventional econometric models instead, rather than trying to analyse what we do not understand. On the other hand, in a world of rational expectations where the expectations generating mechanisms is unknown, or cannot be modelled, VAR models is the best we can do. 9.0.1 How estimate a VAR? First you thing about your system. What is it that you want to explain? How could it be modelled as a recursive system? Second you estimate the equations, by OLS, the same lag lengths on all variables across the equations to avoid using the SUR estimation technique. Third, you investigate outliers and shifts and put in the appropriate dummy variables. Fourth, you try to …nd a short lag structure and white noise residuals. Fifth, if you cannot ful…ll 4) you minimize the information criteria. In this case AIC is not the best choice, use BIC or something else. 9.0.2 Impulse responses in a VAR with non-stationary variables and cointegration. The orthogonalization of the residuals can o¤er some interesting intellectual challenges, especially in SVAR approach. If the variables in the VAR are integrated variables, which also are co-integrating, we are faced with some interesting problems. In the co-integrating VAR model there will be both stationary shocks and permanent chocks, and identifying these two types in the system is not always easy. If the VAR is of dimension p, there can be at most r co-integrating vectors, 0 r p, and p r common stochastic trends. Juselius (2006) ("The Co-integrated VAR Model", Oxford University Press) shows how an identi…cation of the structural MA model, and orthogonalization of the residuals, can be done of both the in terms of short and the long-run of the system. The VAR(2), with no constants, trends or other deterministic variables, will have the following VECM representation after …nding r co-integrating vectors, xt = 1 xt 1 + 0 xt 1 + "t The MA version of this model is, xt = C t X "i + C (L)"t + x0 i=1 Where the …rst factor on the right hand side represent the stochastic trends in the system and the second factor represents stationary part. The C matrix will then represent all that is not the stationary vectors, and is related to the co-integrated vectors as, C = ?( 0 ? ?) 1 0 ?: 90 VECTOR AUTOREGRESSIVE MODELS 9.1 BVAR, TVAR etc. VAR models represent statistical descriptions of data series. As such is a basis for reducing your model and going into more ordinary structural econometric models, such as Vector Error correction Model (VECMs). Estimating a VAR is then a way of making sure that the …nal model is a well-de…ned statistical model, i.e. a model that is consistent with the data chosen. 1. We have talked about what you can do with the VAR in terms of forecasting, simulations, impulse responses, forecast error decomposition and Granger causality testing. in this context we meet the so-called SVAR - Structural VAR. There is, however, a number of other VARs that one needs to know about. The problems of working with VARs are obvious; there is a large amount of variables to be estimated, the estimated parameters might no be stable over time and there is a number of variables that are not modelled in the VAR because the VAR would get too large to handle. If you want to use the VAR for forecasting we need to address these problems. To handle the problem with time varying parameters there are Time-Varying-Parameter VARs (TVP-VARs). In addition there various VAR modeling techniques that deal with regime changes, Markov switching VARs, threshold VARs, ‡oor and ceiling VARs, smooth transition VAR. To work with large number of variables and reduce the model it is possible to factor analysis, which takes us to Factor Augmented VARs (FA-VARs). Another approach is to use a priori information about parameters and their distribution in the form of represented by Bayesian VARs (BVARs). The latter is a popular approach in many central banks. We can illustrate the problem in the following way. Your model predicts that the in‡ation rate will vary around 10%, and the same time you have additional information indicating that in‡ation will ‡uctuate around 5 per cent, say that there is a sudden drop in in‡ation. What do you do? One approach is simply to reduce the constant term and predict changes in in‡ation around 5 per cent instead. A more ambitious approach is to incorporate more information in your model, from more data and place more emphasis on recent observations etc. Changing the constant is easy and quite normal. As you start walking along the path of making assumptions about the data and the parameters of the model you might go too far in the other direction. As long as we talk about forecasting, the proof is in the pudding. The best forecast wins, but as we talk about the best policy to achieve goals in the future you have to be much more careful. The type of VARs we have discussed so far are basically statistical representations of the data. Without futher restrictions, and incorporation of long-run steady state relations in the form of co-integrating vectors, their relative predictability will be quite poor. Also, the economy is more complex, involving many more variables that the two to six variables that can be handled in a standard VAR. If you model contains …fty or one hundred variables there will be too many lags and coe¢ cients to estimate. One way of dealing with this problem is use so-call Bayesian VARs (BVAR). In the BVAR you can use prior information to reduce the number of coe¢ cients you need to estimate. BVAR is popular among many central banks, included both the ECB and the FED to make construct better and bigger VARs for forecasting.5 5 Gary Koop at University of Strathclyde has a home page with course material dealing with BVAR models. BVAR, TVAR ETC. 91 Finally, remember that the data is the real world, economic theories are constructions of the human mind (quote from David Hendry). If you want to use a priori information of some kind you might miss what the data, the real world, is trying to tell you. 92 VECTOR AUTOREGRESSIVE MODELS Part III Granger Non-causality Tests 93 Whether a variable is a¤ected by another in such away that it can be said to cause the other variable is a fundamental question in all sciences. However, to validate empirically that one variable are caused by another variable is problematic in economics since it is often quite di¢ cult to set up controlled experiments. Granger (1969), building upon work done by Wiener, was the …rst to formalize an empirical concept of causality in economics. Granger’s basic idea is that the future cannot predict the present or the past. It follows, as a necessary condition, that for one variable (xt ) to cause another variable (yt ), lagged values of xt must predict yt . This can be tested with the following vector autoregressive model, yt = k X i yt i+ i=1 k X i xt i + et ; (9.13) i=1 where yt is explained by lagged values of yt and xt . The lag length (k) is determined such that et is a white noise process, et N ID(0; 2 ). Alternatively, if you cannot …nd white noise residuals, minimize information criteria only instead. If all parameters associated with the process xt are di¤erent from zero, 1 = ::. = i 6= 0, then xt is predicting yt ; and xt can also be said to Granger cause the variable yt . If, on the other hand, all -parameters are zero, xt cannot predict or cause yt : An F -test on the joint signi…cance of the parameters is su¢ cient in this case. (Alternatively, the test can be set up in the form of chi-square test depending on mainly the software you are using.) The F -test works by comparing the mean squared errors from the equation above with those from a regression where the x0 s are excluded. If the inclusion of lagged x variables leads to a signi…cant reduction in the mean square error, lagged values of xt are predicting yt and the variable xt can be said to Granger cause yt : Please notice the distinction between prediction and causality, which is imPk portant in a policy context. The fact that i=1 i is signi…cantly di¤erent from zero, so that xt is predicting yt ; does not imply that xt causes yt . It is easy to understand why, from the following analogy. A weatherman that predicts rain tomorrow, does not cause the rain that might fall tomorrow. This is so no matter how good this person is predicting tomorrows weather. This is the reason why the test should always be referred to as a Granger non-causality test and not a test of causality. Based on the assumption that the future cannot predict the present and the past, we can only test whether a variable is not causing another. Of course, the outcome of the test might be a¤ected by the number of lags chosen in the VAR, and by the variables chosen to be included in the VAR. Though two variable VARs are common, this is often a crude simpli…cation. The classical example is the e¤ects of real money growth on real GDP growth. In one set-up you might …nd that monetary policy is e¤ective, but add the interest rate to the VAR and you might …nd that monetary policy is ine¤ective. Finding that xt Granger causes yt does not exclude that the reverse is true. Two variables can Granger cause each other. A test of whether yt Granger causes xt , is performed with the following model, xt = k X i=1 i xt i+ k X i yt i + t; (9.14) i=1 where the lag length is the same as before, and t N ID(0, ! 2 ). If lagged values of yt predict xt ; yt is Granger causing xt . In some situation testing the reverse relationship is of no interest. For instance, the in‡ation rate in a small open economy should not Granger cause the in‡ation rate of the World. The main weakness of the Granger non-causality test is the assumption that the error process in the VAR is not only a white noise process, but also a white 95 noise innovation process with respect to all relevant information for explaining the movements of xt and yt . This is an important issue which is often forgotten in applied work, were bivariate systems are the rule rather than the exception. Granger’s basic de…nition of non-causality is based on the assumption that all factors relevant for predicting yt are known. Let It represent all relevant information, both past and present, let Xt be present and past observations on xt , such that Xt = (xt , xt 1 , xt 2 , ..., x0 ); It 1 and Xt 1 represent past observations only. The variable xt can therefore be said to Granger cause yt if the mean square error (MSE) increases when yt is regressed against the information set where Xt 1 is removed. In the bivariate case, this can be stated as, M SE(^ yt jIt 1) < M SE[^ yt j(It 1 Xt 1 ]; (9.15) where y^ is the predicted value of yt . The problem is to know what should be included in It . If too many variables are included the degrees of freedom will diminish. If too few variables are included the test might lead to the wrong conclusions. The result of an unidirectional relation from xt to yt in a bivariate model, might be reversed if a relevant third variable is included in the system. This is a serious limitation of the Granger causality test. A way of reducing the problem is to always perform the tests in a VAR system. If some variable is to be treated as exogenous in the system, this must be based on strong a priori knowledge. The Granger non-causality test is sensitive to the spurious regression problem. The F -test is unreliable when used on integrated or near integrated, which is the standard situation in economics. However, using only …rst di¤erences of the variables implies a loss of information. In this situation it is recommended to include error correction terms (or co-integrated vectors) in the VAR to increase the e¢ ciency of the F -tests. There is an interesting relation between cointegration and Granger causality, as shown by Engle and Granger (1987). If a co-integrating relationship is found, it follows there must exist Granger causality in at least one direction. Tests of cointegration do not exclude causality test, since they cannot determine the direction of the causality. However, if no cointegration is found we can conclude that there is no Granger causality either. 96 10. INTRODUCTION TO EXOGENEITY AND MULTICOLLINEARITY 10.1 Exogeneity Exogeneity assumptions are necessary in econometric model building. In many situations they are used in an ad hoc way; ”determined outside the system”, or based on variables being classi…ed as endogenous and predetermined. Based on this classi…cation of the variables in the system, the basic econometric text book explains how to apply the rank and the order condition to identify a simultaneous system and if it is possible to use OLS or if a system estimator is necessary. In this section we introduce three basic concepts of exogeneity that covers, (1) estimation and inference, (2) conditional forecasting, simulations and (3) policy conclusions. The three concepts that allow you to perform these tasks are weak exogeneity, strong exogeneity and super exogeneity. Consider the following system and there is co-integration. yt = xt = xt + "1t "2t If "1t and "2t are both stationary it follows that xt is I(1) and that yt if I(0) that = 0. On the other hand if 6= 0, it follows that yt is I(1): To estimate it is required that yt is not simultaneously in‡uences xt : If yt or yt is part of the left-hand side of xt equation (and thus embedded in "2t ) the result is that E("1t "2t ) 6= 0, and we can write "1t = "2t + ut : Where for simplicity we assume that ut s N (0; 2 ): Now, if we estimate with OLS, the outcome would be a biased estimate of , since E(xt "1t ) = E(xt ( "2t + ut ), and we can no longer assume that xt and "1t are independent. This is example of lack of weak exogeneity. With the …rst model is not possible to estimate the parameter of interest , the outcome from OLS is a di¤erent and biased value. 10.1.1 Weak Exogeneity Weak exogeneity spell out the conditions under which it is possible to obtain unbiased and e¢ cient estimates. The de…nition is based splitting the joint density function, into a conditional density and a marginal density function; D1 (yt ; zt j Yt 1 ; Zt 1 ; 1) = D2 (yt j yt ; Yt 1 ; Zt 1 ; 2 )D3 (zt INTRODUCTION TO EXOGENEITY AND MULTICOLLINEARITY j Yt 1 ; Zt 1 ; 3 ); (10.1) 97 where the parameters of interest ( ), are a given as = f ( 1 ); Yt 1 and Zt 1 are matrices of the …nite historical values of these variables. The conditions under which it is possible to estimate the parameters of interest by modeling only the conditional density are that 2 and 3 should be variation free, and that are no cross restrictions between the parameters of 2 and 3: In practical situations, using stationary data, this comes down to judging whether the error terms between the marginal and conditional models are correlated.1 (If the data series are integrated the question becomes one of long-run independence between the two residual processes). Three important conclusions follow from the de…nition above. The …rst is that whether a variable is exogenous or not, depends on the parameters of interest. An OLS regression will always lead to estimates of some kind, but what is their meaning. To understand the regression we identify parameters of interest that relate to other variables through the (not modelled) marginal density functions. Thus, exogeneity must be stated in terms of parameters of interest, i.e. the variable yt is weakly exogenous for the parameter yt . Second, it is di¢ cult to test for weak exogeneity. Most existing tests fail, with the exception of Johansen’s test for weak exogeneity of the variables in a co-integrating vectors.2 The meaning of an exogeneity test is mainly to …nd an argument for not specifying the marginal model. However, the de…nition of weak exogeneity tells that this is not possible. A test will need the estimated marginal model, otherwise it will not work. But when the marginal is estimated (and tested for misspeci…cation) the work is already done, so the only thing left is to compare the results. The third conclusion, is that it is not possible to state that a variable like the US in‡ation is determined outside the model for in‡ation in Zambia, or the rainfall in a agricultural model. If these variables enters the system in terms of expectations, it might be necessary to specify the stochastic process that generates these expectations in the model to get unbiased and e¢ cient estimates of the parameters of interest. 10.1.2 Strong Exogeneity Strong exogeneity spells out the conditions for conditional forecasting and simulations of a model with not modelled variables. The condition is weak exogeneity and that the marginal model should not depend on the endogenous variable. Thus the marginal process must be D3 (zt j Yt 1 ; Zt 1 ; 3) = D3 (zt j Zt 1 3 ): (10.2) Meaning that it is not necessary to estimate the marginal process to forecast yt : 1 The condition of no correlation between the error terms is easily understandable if we assume that fyt ; zt g is a bivariate normal process. Set up the density function, and determine the condition when it is possible to estimate the parameters of interest from the conditional model only. 2 Regarding Johansen’s test, it is important to remember that it is model dependent. The test is performed conditionally on the short-run dynamics of the variables included in the system, the dummy variables and the speci…cation of deterministic trend. 98 INTRODUCTION TO EXOGENEITY AND MULTICOLLINEARITY 10.1.3 Super Exogeneity Super exogeneity determines the conditions for using the estimated parameters for policy decisions. The condition is weak exogeneity and that the parameters of the conditional model are stable w.r.t. to changes in the marginal model. For instance, if the money supply rule changes, the parameters of the marginal process will also change. If this also leads to changes of the parameters of the conditional model, the conditional model cannot be used to analyse the implications of policy changes. Thus, super exogeneity de…nes the situations when the Lucas critique is not valid. 10.2 Multicollinearity and understanding of multiple regression. Multicollinearity has to do with how we understand the estimated parameters. Study the following model, yt = 0 + 1 xt + 2 zt + t The estimated parameters of this model is analysed under the assumption that there is no correlation between the variables. The parameter 1 is understood as the e¤ect on yt following a unit change in xt while holding the other variables in the model (zt ) constant. In the same way 2 measures the e¤ect on yt while xt is held constant. Another way of expressing this is the following; Efyt j zt g = 1 xt and Efyt j xt g = 2 zt ; which tells us that the e¤ect of one parameter cannot be analysed in isolation from the rest of the model. The e¤ect of zt in the model is not on yt in it self, it is on yt conditional on xt . The meaning of holding say xt constant in the model, while zt is free to vary implies that we study the e¤ect on yt after ’removing’the e¤ects of xt on yt .If xt and zt are correlated it is not possible to keep one of the constant while the other is changing. This is the multicollinearity problem. The statistical problem is best understood by looking at the OLS variance of ^ : The variance is V ar( ^ 2 ) = P (xt 2 x2 ) (1 xz ) ; where xz is the correlation between xt and zt . If the correlation is perfect, xz = 1; the denominator becomes zero and the calculation of the variance breaks down. Perfect multicollinearity means that the covariance matrix E(X 0 X) 1 does not exist, and there is no solution to = (X 0 X) 1 XY: This is seldom a practical problem, since the computer program that calculates the estimates will break down when it tries to invert the matrix.3 Near and less than perfect multicollinearity, meaning that is between zero and unity, is more complex. However, the problem is limited only to the understanding of the estimated parameters, not in the understanding the model. Less than perfect multicollinearity will a¤ect the residual variance of the model ( 2 ), the estimated 3 If the inversion process does not break down completely, estimated variances of one ore more parameters will be incredibly large. MULTICOLLINEARITY AND UNDERSTANDING OF MULTIPLE REGRESSION. 99 variances of the variables. Historically, a number of measurements, remedies and quick …xes for multicollinearity has been suggested. None of these actually works. In cross section studies a typical problem is to explain household consumption. If you use household income, the number of rooms that the household posses, the number of children and the size of the car as explanatory variables, you would not be surprised to learn that these explanatory variables are highly correlated with each other. As a consequence it might be hard to understand what the parameters are estimating. This example shows that ”throwing in” explanatory variables without a clear economic model in the background will lead to problems. There is no substitute for economic theory in this example. In time series modelling multicollinearity is often, somewhat mistakenly, linked to the estimation of lag lengths. Take the following distributed lag model as an example; xt = 1 xt 1 + 2 xt 2 + "t : If xt is an AR(p) process, the xt variables in the equation are of course correlated, meaning that we cannot hold xt 1 constant and at the same time analyse the e¤ect of varying xt on its own. On the other hand, we are not interested in changing one lag, while keeping the rest …xed. In a time series regression estimation aims at …nding the su¢ cient number of lags that describes the dynamic process. However, since the lags are correlated with each other, this will a¤ect the estimated variance of each lag. This will make it more di¢ cult to determine the correct number of lags in a model, if we were to check the …t of the model by looking at the t-values of the parameters only. Since model building should be aimed at …nding a white noise innovation term, t values are seldom used to decide the over-all …t of the model. Instead we focus on misspeci…cation tests of the model. We can summarize the fact about multicollinearity as follows. There is no way to accurately measure the degree of multicollinearity and there are no quick …xes. Never, under no circumstances, can you delete some variables to ”solve” the problem as is suggested in some textbooks. Deleting variables means that you change the speci…cation and the …t of the model. Leaving out a relevant explanatory variable leads to a misspeci…ed model, which creates bias in the estimates and a¤ects inference. As shown in Hendry (1990 Ch. 6), multicollinearity is not a model problem, or a misspeci…cation problem, it has to do with the interpretation of the estimated variables only, and not with the …t of the model. It can be shown how the variables in a given model can be transformed such that the they become orthogonal to each other, without a¤ecting the …t of the model. Returning to the example above, the interpretation of the parameters can be made clearer if we use the transformation = 1 L, yt = 1 xt + 3 xt 1 + t: (10.3) The transformation is just a reparameterization and does not a¤ect the residual term. The parameter 3 = 1 + 2 which is the long run static solution of the model. Thus we get an estimate of the short run e¤ect on yt from 1 and at the same time a direct estimate of the static long run solution from 3 . If the collinearity between xt and xt 1 is high, it can be assumed to be quite small when we look at xt and xt 1 . Since our …nal interest in modelling economic time series is to …nd a well-de…ned statistical model, which mimics the DGP of the variable(s) multicollinearity is not really a problem. We will therefore not deal with this topic any further. 100 INTRODUCTION TO EXOGENEITY AND MULTICOLLINEARITY 11. UNIVARIATE TESTS OF THE ORDER OF INTEGRATION This section looks at a number of unit root tests, which can be applied to determine the order of integration of a variable. The following tests are presented, DF-test Dickey-Fuller test ADF-test Augmented Dickey-Fuller test Z-test Phillips and Perron’s Z-test (To be included) LMSP-test Schmidt and Phillips LM test KPSS -test Kwiatkowsky, Phillips, Schmidt and Shin test G(p; q)-test Park’s G-test. The alternative hypotheses to having an integrated series are discussed in a following section. 11.0.1 The DF-test: The Dickey-Fuller test is one of the oldest test. The tests builds on the assumed DGP, yt = yt 1 + t with t N ID(0, 2 ): Given this DGP, subtract yt 1 from both sides, and estimate the equation a) yt = yt 1 + t , or, put a constant term in the regression, to allow for the alternative of a deterministic trend in yt 1 , b) yt = + yt 1 + t ; or, put in both a constant and a time trend in the estimated equation, to allow for both a linear deterministic trend and a quadratic deterministic trend in yt ; c) yt = + yt 1 + t + t ; where = 0 if yt is I(1). In this regression, know that will be biased downwards, in a limited sample. Thus, we can put all the risk on the negative side and perform a one-sided test, instead of a two-sided standard t-test. The one sided t-test H0 : ^ = 0 - yt I(1) against, H1 : ^ < 0, yt I(0): The correct 0 t-statistic’for testing the signi…cance of ^ is tabulated in Fuller (1976), under the assumption that yt is random walk, yt N (0; 2 ): The correct distribution for the ”t-test” can also be calculated from MacKinnon (1991), for the exact sample size at hand. In practice the di¤erences are small though. The t-statistics for the constant term and the trend term are tabulated in Dickey and Fuller (1980). Notice that the null hypothesis is that yt = t , where t is white noise. The econometrician, however, will not know 1 To understand why the constant represents a linear deterministic trend, go back to the discussion about the properties of the random walk process. UNIVARIATE TESTS OF THE ORDER OF INTEGRATION 101 this in advance. S=he must therefore set up the estimated model so that there is an meaningful alternative hypothesis to the stochastic trend (or unit root hypothesis). A general alternative is to assume that yt is driven by a combination of t and t2 : It is therefore recommendable, if t is white noise, to start with model c. If the t-value on is signi…cant according to the table in Fuller (1976). The null hypothesis of unit root process is rejected. It follows then that the t-statistics for testing the signi…cance of and follow standard distributions. But, as long as the unit root hypothesis ( = 0) cannot be rejected, both and must be assumed to follow non-standard distributions. Thus, under the hypothesis that = 0, the appropriate distributions for and are found in Dickey and Fuller (1980). In a limited sample it might be wise to compare the outcome of both model c and a. The test is easily extended to higher order unit roots, simply by performing the test on di¤erenced data series. When will the test go wrong? First, if t is not white noise. In principle, et can be an ARIMA process. In the following a number of models dealing with this situation is presented. If there is more than one unit root, then testing for one unit root is likely to be misleading. Hence a good testing strategy is to start by testing for two unit roots, which is done by applying the DF-test to the …rst di¤erence of the series ( yt ). If a unit root in yt is rejected one can continue with testing for one unit root, using the series in level form yt . 11.0.2 The ADF-test The DF-test, like all tests of I(1) versus I(0), is sensitive to deviations from the assumption t N ID(0, 2 ). The assumption of NID errors is critical to the simulated distributions in Fuller (1976). If there is autocorrelation in the residual process the OLS estimated residual will inappropriate, the residual variance estimate will be biased and inconsistent. The ADF-test seeks to solve the problem by augmenting the equations with lagged yt ; yt = yt 1 + k X i yt i + t; (11.1) i=1 or yt = + yt 1+ k X i yt i + t; (11.2) i=1 or yt = + yt 1+ t+ k X i yt i + t: (11.3) i=1 The asymptotic test statistic is distributed as the DF-test, and the same recommendation applies to these equations, make sure there is a meaningful alternative hypothesis. Therefore start with the model including both a constant and a trend. The ADF test is better than the original DF-test since the augmentation leads to empirical white noise residuals. As for the DF-test, the ADF test must be set up in such a way that it has a meaningful alternative hypothesis, and higher order integration must be tested before the one only unit root case.2 2 Sjöö 102 (2000b) explains in some detail how the test is used in practice. UNIVARIATE TESTS OF THE ORDER OF INTEGRATION The critical factor is to choose the length of the augmentation. Because yt is stationary, the distribution of the lags are normal, and standard tests, including Q-tests, LM test for serial correlation in the residual can be used. In small samples the augmentation might play an important role for the outcome of the test. No general rule can be established, more than that the residuals should not display autocorrelation. It is therefore up to the model to convince the readers (the critics) that the …nal verdict regarding the signi…cance, or non-signi…cance of rests on solid ground. An additional complication is how to treat outliers in the sample. Outliers will a¤ect the estimation, in particular the signi…cance of the constant and the trend variable. If trends are signi…cant, under the null of unit root process, according to the Tabulations in Dickey and Fuller (1979), the conclusion is that the estimate of yt 1 follows a normal distribution. Finding signi…cant time trends often implies the rejection of a unit root. But, if this is caused by an outlier a¤ecting the estimation of the trend, one has to be careful in rejecting the unit root. In the case of signi…cant trend variables, leading to the rejection of the unit root hypothesis, some careful investigation of outliers is called for, to be secure against spurious regressions. The DF and ADF tests are the most well known tests, and are easily understood by most people. However, in limited samples and with t not being white noise, they are often quite inconclusive. The tests should therefore be accompanied by graphs and perhaps other tests. 11.0.3 The Phillips-Perron test The ADF-test tries to solve the problem of non-white noise residuals by adding lags of the dependent variable. It should be stressed that the ADF-test is quite adequate as a data descriptive device under the maintained hypothesis that the variables in a sample are integrated of order one. There are, however, a number of tests which tries improve on some of the weaknesses of the ADF-test. Phillips and Perron (1988) suggest non-parametric correction of the test statistic so that the Dickey-Fuller distribution can be used even in cases when the residual in the DFtest is not white noise. (The KPSS-test below a recent modi…cation of the same principle) The method starts from the estimated t-value (t^ ) and the estimated residuals from the DF equation. The test statistic (t ) -the t-value- is modi…ed with the following formula t = T [S 2 S ^ t S S 2 ][std:er(^ )=s] 2S (11.4) where s is the residual variance from the DF regression, S2 = T 1 T X ^2t ; (11.5) t=1 and S2 = T 1 T X t=1 ^2t + 2T 1 l X [1 j(l + 1) j=1 1 ] T X ^t^t j: (11.6) t=j+1 The last term is a non-parametric estimation of the residual variance, using Bartlett’s triangular window. The critical factor is determine the size of the lag window l. UNIVARIATE TESTS OF THE ORDER OF INTEGRATION 103 11.0.4 The LMSP-test Start with the following DGP, yt = + t + xt and xt = xt 1 + t where t N ID(0, 2 ). Under a unit root H0 : = 1. To test, run the following regression, yt = + S^t 1 PT where S^t = t=2 [ yt yt 1 =(T 1)]. Schmidt and Phillips (1992) simulated the t-statistic for ^ : 11.0.5 The KPSS-test This test is calculated by RATS 4. The DGP is assumed to be yt = t + rt + t N ID(0, 2v ). The null where rt = rt 1 + t . t N ID(0, 2 ) and t 2 hypothesis is that yt is stationary. The test is H0 : v = 0, against H1 : 2v > 0. Start by estimating the following equation, yt = + t + et; (11.7) use the estimated residual to construct the following LM test statistic, =T 2 t X St2 =s2 (k); (11.8) e^2i and (11.9) 1 where St2 = i X i=1 s2 (k) = T 1 t X e^2t + 2 T 1 1 k X w(s; k) t X e^t e^t s: (11.10) t=s+1 s=1 The critical values for the test is given in Kwiatkowsky et.al (1992). A Bartlett type window, w(s; k) = 1 [s=(k + 1)] is used to correct the estimate (sample) test statistics correspond to the simulated distribution which is based on white noise residuals. The KPSS test appears to be powerful against the alternative of a fractionally integrated series. That is, a rejection of I(0) does not lead to I(1), as in most unit root test, but rather to a I(d) process where 0 < d < 1. These type of series are called fractionally integrated. A high value of d implies a long memory process. In contrast to an integrated series I(1), or I(2) etc, a fractionally integrated series is reverting. Baillie and Bollerslev (1994). 11.0.6 The G(p; q) test. This test builds on the conclusion that for a unit root variable, the estimated residuals are inappropriate and will indicate that unrelated variables are statistically signi…cant (spurious regression). Therefore estimate, 1 : yt = 104 + t+ 1t (11.11) UNIVARIATE TESTS OF THE ORDER OF INTEGRATION 2 : yt = + 1t + 2 t2 + 2t ; (11.12) where t2 is a super‡uous variable. Calculate the following test statistic, G(1; 2) = (RSS1 RSS2 )=s2 (k); (11.13) where RSS1 and RSS2 are the residual sums of squares from model 1 and 2 respectively, s2 (k) is as above. We can conclude that among theses tests, the ADF test is robust as long as the lag structure is correctly speci…ed. The gains from correcting the estimated residual variance seem to be small. 11.1 The Alternative Hypothesis in I(1) Tests Rejecting one unit root does not necessarily mean that one can accept the alternative of an I(0) series. Sometimes unit root test will reject the assumption of a unit root even though the series is clearly non-stationary. There are several alternatives to rejecting the I I(1) hypothesis, The series is actually I(0). The series is driven by a deterministic rather than a stochastic trend. The series contain more than one unit root.3 The series is driven by segmented trends, meaning that there are di¤erent deterministic trends for di¤erent sub-periods. The series contain fractionally integrated trends. It has an ARFIMA representation (AutoRegressive Fractionally Integrated Moving Average). The series is non-stationary, but driven by some (to us) unknown trend process. Tests for deterministic trends and more than one unit root are straight forward from the section above and not discussed here. The segmented trend approach was launched by Perron (1989). He argues that few series really are I(1): If we have detailed knowledge about the data generating process, we might establish that series have di¤erent deterministic trends for different time periods. The fact that these segmented trends shift over time implies that unit root tests cannot reject the hypothesis of an integrated variable. Thus, instead of detecting the correct deterministic trend(s), the test approximates the changing deterministic trend with a stochastic trend. Perron (1989) demonstrates this fact and drives a test for a known break date in the series. Banerjee et.al. (1992) develop a test for an unknown break date. The problem with this approach is that we somehow have to estimate these segmented trends. Sometimes it will be possible to argue for segmented trends, like World War One and Two, etc., but in principle we are left more or less with ad hoc estimates of what might be segmented trends. 3 Testing for integration should be done according to the Pantula Principle, since higher order integration dominates lower order integration, test from higher to lower order, and stop when it is not possible to reject the null. For instance, a test for I(1) v.s I(0) assumes that there are no I(2)processes. The presence of higher order cointegration might ruin the test for lower order integration, therefore start with I(2) and only if I(2) is rejected will it be meaningful to test for I(1), etc. THE ALTERNATIVE HYPOTHESIS IN I(1) TESTS 105 11.2 Fractional Integration For the class of integrated series discussed above the di¤erence operator was assumed to be d = 1. The choice between d = 0 and d = 1 might be too restrictive in some situations. Especially, if unit root tests reject I(0) in favour of the I(1) hypothesis, when we have theoretical information that suggests that I(1) is implausible, or highly unrealistic. For example, unit root tests might …nd that both the forward and the spot foreign exchange rates are I(1), and that the forward premium (f s), the log di¤erence, is also I(1), indicating no mean reversion in this di¤erence series, and that the forward and the spot rates are not co-integrating. The expectations part of the forward rate would therefore be extremely small or irrational in some sense, so the risk premiums are causing the I(1) behavior. Autoregressive Fractional Di¤erence Moving Average Models, represents a more general class of model than ARMA and ARIMA models, see Granger and Joyeux (1980) and Granger (1980). The ARFIMA (p; d; q) model is de…ned as (L)(1 L)d yt = + (L) t ; (11.14) where d is the fractional di¤erencing parameter. The di¤erence operator (1 L)d is de…ned in terms of its Maclaurins series expansion. The di¤erence operator works in the same way as for ARIMA models, applying the operator to yt results in (1 L)d yt = zt where zt has an ARMA representation. The FI operator transforms the original series into a series which has an ARMA representation. Once the long-run memory is removed, the standard techniques for identifying the ARMA process can be applied. The di¤erence between ARIMA and ARFIMA models is that the latter allows for a more complex memory process. The Wold theorem says that any nondeterministic series has an in…nite MA representation like, yt = 1 X i t i; (11.15) i=0 P1 where t iid(0, 2 ), and i=0 2i < 1. If this series also belongs to the class of series which has an ARMA representation, the autocorrelation function will die out exponentially. For an I(1) the autocorrelation function will display complete persistence, the theoretical autocorrelation function is unity for all lags. Because the autocorrelation function of an ARMA process dies out exponentially, it can be said to have a relatively short memory compared to series which have autocorrelation functions which do not die out as quickly. ARFIMA series, therefore represents long memory time series. The ARFIMA model allows the autocorrelation coe¢ cients to exhibit hyperbolic patterns. For d < 1, the series is mean reverting, for 0:5 < d < 0:5 the ARFIMA series is covariance stationary. For a statistician who is describing the behavior of a time series an ARFIMA model might o¤er a better representation than the more traditional ARMA model, see Diebold and Rudebush (1989) Sowell (1992). For an econometrican however, the economic understanding is of equal importance. The standard question in most economic work is whether to use levels or percentage growth rates of the data, to construct models with known distributions. That means decide whether series are I(0) or I(1). Fractional integration does not a¤ect these problems. It becomes important when we ask speci…c questions about the type of long-run memory we are dealing with, like is there mean reversion in the forward premium, or the real exchange rate, or in assets prices etc. Thus only when economic theory gives us a reason for testing something else than I(0) and I(1) is fractional integration 106 UNIVARIATE TESTS OF THE ORDER OF INTEGRATION of interest. For applications of long-memory tests in general see Lo (1991) and Cheung and Lai (1995). FRACTIONAL INTEGRATION 107 108 UNIVARIATE TESTS OF THE ORDER OF INTEGRATION 12. NON-STATIONARITY AND CO-INTEGRATION Most macroeconomic and …nance variables are non-stationary. This has enormous consequences for the use of statistical methods in economics research. Statistical theory assumes that variables are stationary, if they are not stationary statistical inference is generally not possible. It doesn’t matter that numerous old textbooks in econometrics and research papers have ignored the problem. The problems associated with non-stationary variables in econometrics has been known since the 1920s, but didn’t get a solution until the end of the 1980s. In principle there two ways of dealing with non-stationary, you must either remove the non-stationarity before setting up the econometric model or set up a model of non-stationary variables that forms a stationary relation. Typically, in none of these cases can you use standard inference based on t-, chi–square or F-distributions. Now, variables can be non-stationary in an in…nite number of ways. In practice, there are broadly two types of non-stationary variables of interest in econometrics. The …rst type are variables stationary around a deterministic trend. The second type are variables stationary around a stochastic trend. Stochastic trend variables are also known as integrated variables. Most variables in economics and …nance seem to be driven by stochastic trends. The problem with stochastic trend variables (integrated variables) is that not only do they not follow standard distributions, if you try to use standard distributions you will most likely be fooled into thinking there are signi…cant relations when in fact there are no relation. This is know as the spurious regression problem in the literature. Historically, trends were dealt with by removing what people assumed was a linear deterministic trend. This was done in the following way. The non-stationary variable was regressed against a constant and a linear trend variable; yt = + t + y~t (12.1) where t was a deterministic time trend, de…ned as t = 1; 2; :::; T ). The residual y~t in this regression represents the de-trended yt series, which was then used in regression models with other stationary or detrended variables. In the equation above becomes a combination of the sample mean of yt , and the average of the time variable. In general, the deterministic trend removal can be done with models including polynomial deterministic trends, such as yt = + 1t + 2t 2 + ::: + nt n + y~t : (12.2) This approach of …tting deterministic trends can be extended into cyclical trends, using trigonometric functions in combinations with the time trend. In the literature there are various deterministic …lters that aim at removing long-run (supposedly deterministic) trends such as the so-called Hodrick-Prescott …lter. However, if the series is driven by a stochastic trend the estimated variables of these models will not follow standard distributions and the regression will impose a spurious autocorrelation pattern in the spuriously detrended variable y~t . Thus, until you have investigated the non-stationary properties of the series and tested for stochastic trends (order of integration) it is not possible to do any econometric modelling. NON-STATIONARITY AND CO-INTEGRATION 109 Deterministic trends are seldom the best choice for economic time series. Instead the non-stationary behaviour is often better described with stochastic trends, which have no …xed trend that can be predicted from period to period. A random walk serves as the simplest example of a stochastic trend. Starting from the model, yt = yt 1 + vt where vt N ID(0; 2 ); (12.3) repeated substitution backwards leads to, yt = y0 + t X vi : (12.4) i=0 The expression shows how the random walk variable is made up by the sum of all historical white noise shocks to the series. The sum represents the stochastic trend. The variable is non-stationary, but we cannot predict how it changes, at least no by looking at the history of the series. (See also the discussion above concerning random walks under the section about di¤erent stochastic processes) The stochastic trend term is removed by taking the …rst di¤erence of the series. In the random walk case it implies that yt = vt is a stationary variable with constant mean and variance. Variables driven by stochastic trends are also called integrated variable because the sum process represents the integrated property of these variables. A generic representation is the combination of deterministic and stochastic trends, yt = + t + t + y~t ; (12.5) where t = t 1 + vt ; vt is N ID(0; 2 ); t is the deterministic trend and y~t is a stationary process representing stationary part of yt : In this model, the Pthe t stochastic trend is represented by i=1 vi : An alternative trend representation is segmented deterministic trends, illustrated by the model yt = + 1 t1 + 2 t2 + ::: + k tk + y~t (12.6) where t1 ; t2 etc; _ are deterministic trends for di¤erent periods, such as wars, or policy regimes such as exchange rates, monetary policy etc.. Segmented trends are an alternative to stochastic trends, see Perron 1989, but the problem is that the identi…cation of these di¤erent trends might be ad hoc. Given a suitable choice of trends almost any empirical series can be made stationary, but are the di¤erent trends really picking up anything interesting, that is not embraced by the assumption of stochastic trends, arising from innovations with permanent e¤ects on the economy? 12.0.1 The Spurious Regression Problem Most macroeconomic time series display non-stationarity and appears to be driven by stochastic trends. Regression with these variables leads to the danger of nonstandard distributed parameter estimates which make inference much more di¢ cult. The spurious regression problem was introduced in a article by Granger and Newbold in 1973. Granger and Newbold generated two random walk series, which were independent of each other by construction. Let the two variables be xt and yt , 110 NON-STATIONARITY AND CO-INTEGRATION with …rst di¤erences yt N ID(0; 2y ), and xt N ID(0; 2x );by construction let yt and xt be independent. Next, consider the linear regression of yt and xt ; yt = + xt + "t : (12.7) Since yt and xt are independent there is no relation between them must be zero and we would expect that the t-statistic of ^ will go to zero as the sample size increases so that t ^ N ID(0; 1). If we repeat the regression with new independent random walk we expect that in 5 per cent of test we would be unlucky and erroneously assume that there is signi…cance even though true value of is zero. However, this is not what happens. Granger and Newbold studied the empirical distribution of the regression above. They run 1000 regressions and found that the distribution of the t-statistic of ^ was the opposite of what we expect. In 95 % of the regression we …nd a signi…cant relation even though the true value should be 5 %. Asymptotically the t-value of ^ approached 2:0. The problem got worse when more independent random walks were put into the equation. Granger and Newbold did also …nd that the reported R2 values became relatively high while the Durbin-Watson value became low. Later in the 1980s, researchers such as Peter Phillips, showed that due to the integrated properties of the variables, their sample moments converge to functions of Wiener processes (Brownian motions). The sample moments will not converge to constants, like in the case of stationary stochastic regressors. Instead, the sample moments converge to random variables which are functions of Wiener processes. In this situation, with two (or more) random walk variables regressed against each other the t-statistics will approach 2.0 zero instead of 0.0. Thus, by using the t-distribution to test the null of no correlation between the variables, one will be fooled into rejecting the assumption of no correlation. This is the spurious regression problem. It is caused by parameter estimates which are not distributed according to the normal distribution, not even in the long run. In practical work, that is when using limited samples, this will occur not only when regressing random walk variables, but also when regressing integrated variables or near-integrated variables. Near-integrated variables are a classi…cation of variables which in a limited sample, look and behave like integrated variables. An autoregressive process with an autoregressive parameter close to unity (say 0.9) can be called near integrated. In these situations, the distribution theory of integrated variables is a much better approximation than the standard normal. 12.0.2 Integrated Variables and Co-integration Normally, a linear combination of integrated variables will also be integrated of the same order as individual variables. The exception from this rule is called cointegration, when a linear combination of integrated variables results in a lower order of integration. So, in the linear regression above, since both yt and xt are integrated of order one I(1), and independent, the residual term "t will be integrated of order one I(1) as well. In the case when the two I(1) variables share the same stochastic trend and form an I(0) residual we say that they are co-integrating. NON-STATIONARITY AND CO-INTEGRATION 111 The intuition here is that for the two variables to form a meaningful long-run relationship, their must share the same trend. Otherwise they will be drifting away from each other as time elapses. Therefore, to build econometric models which make sense in the long run, we have to investigate the trend properties of the variables and determine the type of trend and whether variables are cotrending and co-integrating or not. In econometric work, trend properties refer to the properties of the sample and how to do inference. It is not a theoretical concept about how economics variables grow in the long run. Once we have clari…ed the trend properties, it becomes possible to establish stationary relations and models, and econometric modeling can proceed as usual, and standard techniques for inference can be used. De…nitions: De…nition 1 A series with no deterministic component and which has a stationary and invertible autoregressive moving average (ARMA) representation after di¤ erencing (d) times, but which is not stationary after di¤ erencing (d 1) times, is said to be integrated of order d, denoted xt I(d): De…nition 2 The components of the vector xt are said to be co-integrated of order d; b, denoted xt CI(d; b); if (i) xt is I(d) and (ii) there exists a non-zero vector such that 0 xt I(d b); d b > 0: The vector is called the co-integrating vector.(Adapted from Engle and Granger (1987)). Remark 1 If xt has more than two elements there can be more than one cointegrating vector . Remark 2 The order of integration of the vector xt is determined by the element which has the highest order of integration. Thus, xt can in principle have variables integrated of di¤ erent orders. A related de…nition concerns the error correction representation following from co-integration. De…nition 3 A vector time-series xt has an error-correction representation if it can be expressed as A(L)(1 L)xt = zt 1 + ! t ; where ! t is a stationary multivariate disturbance term, with A(0) = I; A(1) having only …nite elements, zt = 0 xt ; and a non-zero vector. For the case where d = b = 1, and with co-integrating rank r, the Granger Representation Theorem holds. (Adapted from Banerjee et.al (1993)) Remark 3 This de…nition and the Granger Representation Theorem (Engle and Granger, 1987) tell us that if there is co-integration then there is also an error correction representation, and there must be Granger causality in at least one direction. 12.0.3 Approaches to Testing for Co-integration Under the general null hypothesis of independent and integrated variables estimated variances, and test statistics, do not follow standard distributions. Therefore the way ahead is to test for co-integration, and then try to formulate a regression model (or system) in terms of stationary variables only. Traditionally there are two approaches of testing for co-integration; residual based approaches and other approaches. The …rst type starts with the formulation of a co-integration regression, a regression model with integrated variables. Co-integration is then determined by investigating the residual(s) from that regression. The Engle and 112 NON-STATIONARITY AND CO-INTEGRATION Granger two-step procedure and the Phillips-Oularies test are examples of this approach. The other approach is to start from some representation of a co-integrated system, (VAR, VECMA, etc.) and test for some speci…c characterization of cointegrated systems.. Johansen’s VECM approach, or tests for common trends are examples. The Engle and Granger’s two-step procedure is the easiest and most used residual based test. It is used because of its simplicity and ease of use, but is not a good test. The two-step procedure, starts with the estimation of the co-integrating regression. If yt and xt are two variables integrated of order one, the …rst step is to estimate the following OLS regression yt = + xt + zt (12.8) where the estimated residuals are z^t : If the variables are co-integrating, z^t will be I(0). The second step is to perform an Augmented Dickey-Fuller unit root test of the estimated residual, z^t = + z^t 1 + k X z^t i + "t : (12.9) i=1 1. If yt and xt share a common trend and co-integrate the residual must be a stationary process. If they don’t share a common trend, they do not co-integrate, the parameter must be zero and the residual zt must be non-stationary and integrated of the same order as yt : If the null, H0 : ^ = 0; is rejected for the alternative HA : < 0; we conclude that the variables are co-integrated, and that the long-run co-integrating parameter is : Furthermore; we can refer to the OLS regression as the co-integrating regression. We know that the residual is stationary, z^t is I(0) and therefore z^t 1 can be used as en error correction term, identifying the long-run steady state relation between yt and xt : The relevant test statistics are not the one tabulated by Fuller (1976). Instead you have to look new simulated tables in Engle and Granger (1987), Engle and Yoo (1987), or Banerjee et al (1993). The reason is that the unit root test is now performed, not on a univariate process, but on a variable constructed from several stochastic processes. The test statistic will change depending on how many explanatory variables there are in the model. Remark 4 Remember that the t-statistics, and the estimated standard deviations, from the co-integrating regression must be considered, even if we …nd cointegration. Unless xt is exogenous the estimated parameters follow unknown non-normal distributions even asthmatically. Remark 5 For the outcome of the test, it will not matter which variable is chosen to be the dependent variable. As an economist you might favour setting one variable as dependent and understand the parameters as long-run economic parameters (elasticities etc.) There are a number of problems with the Engle and Granger two-step procedure. The …rst is that the tabulated (non-standard) test statistic assumes white noise residuals. The augmentation tries to deal with this but is in most cases it is only a crude approximation. Second, the test assumes a common factor in the dynamic processes of yt and xt : In practice this restriction is quite restrictive and the test will not behave NON-STATIONARITY AND CO-INTEGRATION 113 good when it does not hold. The dynamics of the two process and their possible co-integrating relation is usually more complex. Third, the test assumes that there is only one co-integrating vector. If we test for co-integration between two variables this is not a problem, because then there can be only one co-integration vector. Suppose that we add another I(1) variable (ut ) to the co-integrating regression equation, yt = + 1 xt + 2 ut + t: (12.10) If yt and xt are co-integrating, they already form one linear combination (zt ) which is stationary. If ut I(1) is not co-integrating with the other variables, OLS will set 2 to zero, and the estimated residual ^t is I(0). This is why the test will only work if there is only one co-integrating vector among the variables. If yt and xt are not co-integrating then adding ut I(1) might lead to a co-integrating relation. Thus, in this respect the test is limited, and testing must be done by creating logical chains of bi-variate co-integration hypotheses. Other residual based tests try to solve at least the …rst problem by adjusting the test statistics in the second step, so that it always ful…lls the criteria for testing the null correctly. Some approaches try to transform the co-integrating regression is such a way that the estimated parameters follow a standard normal distribution. A better alternative to testing for co-integration among more than two variables is o¤ered by Johansen’s test. This test …nds long long-run steady-state, or cointegrating, relations in the VAR representation of a system. Let the VAR, Ak (L)xt = Dt + "t ; (12.11) represent the system. The VAR is a p-dimensional system, the variables are assumed to integrated of order d; fxgt I(d); Dt is a vector deterministic variables, constants, dummies, seasonals and possible trends, is the associated coe¢ cient P matrix. The residual process is normally distributed white noise, "t ID(0; ). It is important to …nd the optimal lag length in the VAR and have a normal distribution of the error terms in addition to white noise because the test uses a full information maximum likelihood estimator (FIML). estimators are notoriously sensitive to small samples and misspeci…cations why care must be taken in the formulation of the VAR. Once the VAR has been found, it can be rewritten in error correction form, xt = xt 1 + k X i xt 1 + Dt + "t (12.12) i=1 In practical use the problem is to formulate the VAR, the program will rewrite the VAR for the user automatically. Johansen’s test builds on the knowledge that if xt is I(d) and co-integration implies that there exists vectors such that 0 xt I(d b). In a practical situation we will assume that xt (1) and if there is cointegration, 0 xt I(0). If there is cointegration, the matrix must have reduced rank. The rank of indicates the number of independent rows in the matrix. Thus, if xt is a p-dimensional process, the rank (r) of matrix determines the number of co-integrating vectors ( ), or the number of linear steady state relations among the variables in fxgt : Zero rank (r = 0) implies no cointegration vectors, full rank (r = p) means that all variables are stationary, while a reduced rank (0 < r < p) means cointegration and the existence of r co-integrating vectors among the variables. The procedure is to estimate the eigenvalues of and determine their signi…cance.1 1 The 114 test is called the Trace test and its use is explanied in Sjö Guide to testing for ... NON-STATIONARITY AND CO-INTEGRATION However, under the null of no co-integration, these estimates have non-standard distributions which depend on whether there is a deterministic trend, and or a constant term in the model. The test statistic is only known asymptotically and for a closed system without exogenous variables. In other situations the decision must be based on viewing the test statistics as approximations. 0 Once the rank of is known, the matrix can be rewritten as = such that 0 xt forms stationary co-integrating relations. The are co-integrating parameters, and represent the adjustment parameters. The signi…cance of the alphas can be determined by ordinary t-test since they are associated with stationary relations, 0 xt I(0) Finding the VECM In practical use the problem is to formulate the VAR, the program will rewrite the VAR for the user and present the estimated and vectors. Sometimes it necessary to understand how the VECM is found. Consider the 2 dimensional VAR model, where the deterministic terms have been removed for simpli…cation, yt = a11 yt 1 + a12 yt 2 + a13 xt 1 + a14 xt zt = a21 zt 1 + a22 zt 1 + a23 zt 1 + a24 zt Start with the …rst equation and with 2 2 + e1t (12.13) + e2t : (12.14) yt from both sides of the equal sign. This gives you yt = (a11 1)yt 1 + a12 yt 2 + a13 xt 1 + a14 xt 2 + e1t since the equation was correctly speci…ed from the beginning it can transformed as long as we do not do anything that a¤ects the properties of error term. Our aim is to split all lag terms into …rst di¤erences and lagged variables in such a way that the model consists of one lag at t-1 for all variables and …rst di¤erences. We can do this by using the di¤erence operator, = (1 L), which can be used as yt = yt yt 1 , or yt 1 = yt yt : Referring to the operators we have L = (1 ); or Lyt = (yt yt ): If we apply this to all lags of lower order than t 1, we get for t 2 the following, yt 2 = yt 1 yt 1 , and zt 2 = zt 1 zt 1 : Substitute this into the equation to get, yt = (a11 1)yt 1 + a12 (yt yt 1 1) + a13 zt 1 + a14 (zt zt 1 1) + e1t Collecting terms gives, yt = ( 1 + a11 + a12 )yt 1 a12 yt 1 + (a13 + a14 )zt 1 a14 zt 1 + e1t yt 1 + e1t Performing the same operations on the second equation zt = ( 1 + a21 + a22 )zt 1 a22 zt 1 + (a23 + a24 )yt 1 24 Write the system in matrix form, xt = xt 1+ 1 X i xt 1 + "t (12.15) i=1 NON-STATIONARITY AND CO-INTEGRATION 115 where xt = yt ; xt = zt yt ; zt = 11 12 21 22 ;and 1 = 11 : 21 Since xt is integrated of order one, it follows that xt is integrated of order zero and therefore stationary. And, since xt is non-stationarity, the variables in xt grows in two dimensions unless they share the same trend. In that case we would say that they are co-integrated and share one common trend. In the case of a p-dimensional system, the system can expand in p dimensions or in less than p dimensions if variables share the same trend. under these properties a single yt 1 or zt 1 cannot be correlated with yt or zt : The only possible correlation that will not render the rows in to be di¤erent from zero is when ( 11 yt a + 12 zt 1 ) forms a stationary process, i.e. there exists non-zero parameters 11 and 12 (or 21 and 22 ) such that when multiplied with the x:s a stationary relation is established. The test for this is to test for the rank of the matrix, the number of independent non-zero rows in : A rank of zero mean no co-integration, rank of 2 in this case means that the x:s are stationary, or stationary around deterministic trends if we allowed for constants in the equation. A reduced rank, which in this case is a rank equal 1, implies co-integration. Co-integration will imply that at least one parameter will be signi…cant, there will be (long-run) Granger causality in at least one direction. At least one variable must follow the other for them to stay together in …xed formation on the long run. Johansen’s test is better than the two-step procedure in almost all aspects. The practical problems originate from choosing a correct combination of lags and dummy variables to make the residual come out as white noise. In a limited sample this can be di¢ cult, and the results might change among di¤erent speci…cations of the system, just as it does in the two-step procedure. It is recommended to start with the two-step procedure, to ”learn” about the data and get some preliminary results, instead of getting stuck with the Johansen test, having problems …nding a speci…cation that leads to economically interesting results. 116 NON-STATIONARITY AND CO-INTEGRATION 13. INTEGRATED VARIABLES AND COMMON TRENDS This chapter looks the common trends approach and some economics behind cointegration. For instance, the question of creating positive or negative shocks in stabilization policy. An important characteristic of integrated variables is that they become stationary after di¤erencing. The de…nition of an integrated series is; A series, or a vector of series, yt with no deterministic component, which has a stationary, invertible ARMA representation, after di¤erencing d times is said to be integrated of order d, denoted as xt I(d). It is possible to have variables driven by both stochastic and deterministic trends. In the very long run a deterministic trend will always dominate over a stochastic trend. In a limited sample however, it becomes an empirical question if the deterministic trend is su¢ ciently strong to have an e¤ect on the distributions of the estimates of the model.1 We know, from the Wold representation theorem, that if yt is I(0), and has no deterministic process, it can be written as an in…nite moving average process. (If the series has a deterministic process this can be removed before solving for the MA process). yt = C(L) t ; (13.1) where L is the lag operator, and t iid(0; 2 ). Now, suppose that yt is I(1), then its …rst di¤erence is stationary and has an in…nite MA process, yt = C(L) t : Under the assumption that yt = 1 t iid(0; C(L) t 2 (13.2) ), we have also that = [1=(1 L)]C(L) t : (13.3) where 1=(1 L) represents the sum of an in…nite series. For a limited sample, we get approximately, yt = y0 + (1 + L + L2 + ::: + Lt 1 )C(L) t ; (13.4) where y0 is the initial value of the process seen as a deterministic component conditional on everything known at time zero. The long-run solution of this expression, setting L = 1, gives tC(1), and yt = y0 + t C(1) t : (13.5) Unless C(1) = 0, this process will grow in…nitely large as t ! 1. Looking at the second di¤erence of yt I(1), leads to 2 yt = (1 L)C(L) t ; (13.6) where = (1 L) is applied to both sides of the expression. This series has no long run MA representation, irrespective of C(L) = C(1) 6= 0, since setting L = 1 gives (1 L)C(L) = (1 1)C(1) = 0: 1 See Nelson and Plosser (1982) for a discussion about the proper way to model the trend in economic time series. INTEGRATED VARIABLES AND COMMON TRENDS 117 Let us see what happens with the process in the future. From above we get the MA representation for some future period t + h; 2 3 t+h t+h X Xi 4 (13.7) yt+h = y0 + Cj 5 i i=1 = y0 + t X i=1 2 j=0 t+h Xi 4 j=0 3 Cj 5 i + t+h X i=1+t 2 t+h Xi 4 j=0 3 Cj 5 i : (13.8) The forecasts are decomposed into what is known at time t, the …rst double sum, and what is going to happen between t and t + h i. The latter is unknown at time t, therefore we hhave to formi the conditional forecast of yt+h at time t; Pt Pt+h i yt+h jt = y0 + i=1 Cj i : j=0 The e¤ect of a shock today (at time t) on future periods is found by taking the derivative of the above expression with respect to a change in t ; @yt+h jt =@ t = h X j=0 Cj ! C(1) as t ! 1: (13.9) Thus, the long-run e¤ect of a shock today can be expressed by the static long run solution of the MA representation of yt . (Equal to the sum of the MA coe¢ cients). The persistence of a shock depends on the value of C(1). If C(1) happens to be 0, there is no long-run e¤ect of today’s shock. Otherwise we have three cases, C(1) is greater than 0, C(1) = 1 or C(1) is greater than unity. If C(1) is greater than 0 but less than unity, the shock will die out in the future. If C(1) = 1, the integrated variables (unit roots) case, a shock will be as important today as it is for all future periods. Finally, if C(1) is greater than one (explosive roots) the shock magni…es into the future, and we have an unstable system. If the series are truly I(1), spectral analysis can be applied to exactly measure the persistence of a shock. The persistence of shocks has interesting implications for economic policy. If shocks are very persistent, or explosive in some cases, it might be a good policy to try to avoid negative shocks, but create positive shocks. In our stabilization policy example, this can be understood as the authorities should be careful with de‡ationary policies, for instance, since they might result in high and persistent social costs, see Mankiw and Shapiro (198x) for a discussion of these issues. In the following, the MA representation of systems of integrated processes are analysed. For this purpose let yt be vector of I(1)-variables. Using the lag operator as, L = 1 (1 L) [1 ]2 , and Wald’s decomposition theorem gives, yt = C(L) t = C(1) t + C (L) t : (13.10) If yt is a vector of I(1) variables, then we know from above that if the matrix C(1) 6= 0, any shock to the series has in…nite e¤ects on future levels of yt . Let us consider a linear combination of these variables 0 yt = zt . Multiplication of the expression with 0 gives, 0 zt = C(1) t + 0 C (L) t : (13.11) In general, it is the case that when yt is I(1); zt is I(1) as well. Thus a linear combination of integrated variables will also be integrated. Implying that 0 C(1) 2 This 118 is the same as yt = yt + yt 1 = [(1 L) + L]yt : INTEGRATED VARIABLES AND COMMON TRENDS is di¤erent from zero. Suppose, however, that there exists a matrix 0 such that 0 C(1) = 0, which implies that when is multiplied with yt we get a stationary process, 0 yt I(0). As an example consider private aggregated consumption and private aggregated (disposable) income. Both variables could be random walks, but what about the di¤erence between these variables? Is it likely to assume that a linear combination of them could be driven by a stochastic trend, meaning that consumption would deviate permanently from income in the long run? The answer is — no. In the long-run it is not likely that a person consume more than his/her income, nor is it likely that a person will save more and more. Thus, we have to think of situations when two variables cointegrate, that is when a linear combination of I(1) series forms a new stationary series, integrated of order zero. (A more formal de…nition of co-integration is given in a following section.) In terms of the C(1) matrix, common trends or co-integration implies that there exists a matrix such that 0 C(1) = 0, hence we get, 0 zt = C(L) t ; (13.12) where zt is integrated of order zero, when yt consists of variables integrated of order one. The mathematical condition for having a vector such that 0 C(1) = 0 is that C(1) has reduced rank. There must be at least one row representing the long run that can be solved from the other long run relations in C(1). If C(1) has reduced rank, there can be several 0 - vectors that lead to 0 C(1) = 0. We can express this as follows: any vector lying in the null space of C(1) is a co-integrating vector and that the co-integrating rank of C(1) is the rank of this null space. Say that yt is a vector of n variables. If all variables are non-stationary, and integrated of order one, the whole system could expand in n di¤erent directions. If some or all variables share the same trend in the long-run, the system would be expanding in only r < r dimension. How should we understand the reduced rank of C(1) in economic terms? Think of consumption and income again. If both series are I(1), constantly growing in the long-run. The di¤erence between them should be stationary in the long run. In other words they must have a common (stochastic) trend, which the both follow in the long run. This common trend could understood as a given by technological growth, which leads to growth in income and thereby also to a long-run growth in consumption. Another way of expressing the same thing is to say that the common trend represents the cumulation of past technology shocks. Stock and Watson (1988), modeled the common trends representation of yt in the following way. Starting from yt = C(1) t + C (L) t , the level of yt is determined by, yt = y0 + (1 + L + L2 + :::Lt 2 1 )[C(1) + (1 t 1 = y0 + C(1)(1 + L + L + :::L L)C (L)] ) t + C (L) t : t (13.13) (13.14) If we have cointegration, and therefore common trends, C(1) must be of reduced rank. The matrix C(1) can be thought of as consisting of two sub-matrices, such that C(1) = AJ, where J is de…ned as, t = t 1 J t = (1 + L + L2 + :::Lt 1 )J t : (13.15) The variable t represents the common trends, modelled as random walks. Setting the initial condition 0 = 0, the level of yt is solved as yt = y0 + A t + C (L) t (13.16) which shows that yt is driven by the common trends representation A t . It can also be shown that, since C(1) = AJ, that 0 A = 0, which implies that the INTEGRATED VARIABLES AND COMMON TRENDS 119 co-integrating linear combinations of yt have no common trends. In terms of the C(1) matrix, we can talk about two types of shocks. The …rst type of shocks decline over time, so that the variables in the system return to their equilibrium relation. These shocks are driven by the co-integrating vectors. The second type of shocks are those which move the whole system over time without a¤ecting the long-run equilibrium. These shocks are the common trends of system. Cointegration and common trends have interesting implications for econometric model building and inference on dynamic models. For an econometrician, however, the MA representation are not always the easiest way to approach the concept of cointegration and stationary long-run relations. 120 INTEGRATED VARIABLES AND COMMON TRENDS 14. A DEEPER LOOK AT JOHANSEN’S TEST Earlier we looked at the moving average representation of a vector of integrated variables. This takes us to a de…nition of common trends in a system of variables with stochastic trends. For an economist it is usually more interesting to analyse a system in autoregressive format. By looking at the VAR representation we get a de…nition of cointegration or long-run steady state relations among the variables. Let the process fyt g be represented by the following k : th order vector autoregressive (VAR) model, consisting of p variables , A(L)yt = Dt + t ; (14.1) where Dt a vector of deterministic variables, P including dummies and constants. j k The error term is t NID(0; ), and A(L) = j=0 Aj L where A0 =P I. Thus, we are assuming that the process is multivariate normal, y N ID( ; ), with mean P = A1 yt 1 + ::. +Ak yt k + Dt , and positive de…nite error covariance matrix : The system can be rewritten in error correction form, using the de…nition of the di¤erence operator, yt yt yt 1 ; yt = k X1 i yt i + yt k + Dt + t ; (14.2) i=1 where i = (I + i X Aj ); (14.3) j=1 and = (I + k X Aj ) = A(1): (14.4) j=1 Notice that in this example the system was rewritten such that the variables in levels (yt k ) ended up at the k : th lag. As an alternative it is possible to rewrite the system such that the levels enter the expression at the …rst lag, followed by k lags of yt i . The two ways of rewriting the system are identical. The preferred form depends on ones preferences. Since yt is integrated of order one and yt is stationary, it follows that there can be at most p 1 steady state relationships between the non-stationary variables in yt . Hence, p 1 is the largest possible number of linearly independent rows in the -matrix. The latter is determined by the number of signi…cant eigenvalues ^ in the estimated matrix ^ = A(1). Let r be the rank of , then rank( ) = 0 implies that there are no combinations of variables that leads to stationarity. In other words, there is no cointegration. If we have rank( ) = p, the matrix is said to have full rank, and all variables in y t must be stationary. Finally, reduced rank, 0 < r < p means that there are r co-integrating vectors in the system. Once a reduced rank has been determined, the matrix can be written as = 0 , where 0 yt represent the vectors of co-integrating relations, and a matrix of adjustment coe¢ cients measuring the strength by which each co-integrating vector a¤ects an element of yt . Whether the co-integrating vectors 0 yt are referred A DEEPER LOOK AT JOHANSEN’S TEST 121 to as error correction mechanisms, steady state relations, long-run equilibrium solutions or desired value is a question of how one views the underlying economic mechanisms. Given estimates of the eigenvalues of , and , it becomes possible to impose various restrictions on the parameter vectors to test homogeneity conditions in the -vectors, how 0 yt a¤ects yt , or a more general hypothesis regarding which combinations of variables that form stationary vectors. The tests are performed by comparing changes in the estimated eigenvalues from the unrestricted reduced rank estimate of with the outcome of a restricted estimation. In Johansen (1988) it is shown how to estimate the and the vectors in the matrix, given that the latter has reduced rank. The solution starts from conditioning out the short-run dynamics, as well as the e¤ects of the dummy variables on yt and yt k respectively, yt = k X1 1 ;i yt i + 0 Dt + R1t ; (14.5) k X1 2 ;i yt i + 2 Dt + Rkt : (14.6) i=1 yt p = i=1 The system in 14.1 can now be written in terms of the residuals above as, R1t = 0 Rkt + et : (14.7) The vectors and can now be estimated by forming the product moment matrices S11 , Skk and S1k from the residuals R1 ;t and Rk ;t ; Sij = T 1 T X Rit Rjt ; i; j = 0; k (14.8) i=1 For …xed vectors, is given by ^ ( ) = S1k ( 0 Skk ) 1 , and the sums of squares function ^ ( ) = S11 ^ ( )( 0 Skk )^ ( )0 . Minimizing this sum of squares function leads to maximum likelihood estimates of and . The estimates of are found after solving the eigenvalue problem, j Skk Sk1 S111 S1k j = 0; (14.9) where is a vector of eigenvalues. The solution leads to estimates of the eigenvalues( ^ 1 , ^ 2 ; :::, ^ ), and the corresponding eigenvectors V^ = (^ v1 , v^2 , ..., v^ ), normalized around the squared residuals from equation 14.7 such that V 0 S22 V = I. The size of the eigenvalues ( i ) tells us how much each linear combination of eigenvectors and variables, vi0 yt is correlated with the conditional process R1t ( yt j yt i , D). The number of non-zero eigenvalues (r) determines the rank of and lead to the co-integrating vectors of the system, while the number of zero eigenvalues (p r) de…ne the common trends in the system. These are the combinations of vi yt that determine the directions in which the process is non-stationary. Given that 14.1 is a well-de…ned statistical model, it is possible to determine the distribution of the estimated eigenvalues under di¤erent assumptions of the number of co-integrating vectors in the model. The distributions of the eigenvalues depend not only on 14.1 being a well-de…ned statistical model, but also on the number of variables, the inclusion of constant terms in the co-integrating vectors and deterministic trends in the equations. Distributions for di¤erent models are tabulated in Johansen (1995). 122 A DEEPER LOOK AT JOHANSEN’S TEST The maximized log likelihood, conditional on the short run dynamics and the deterministic variables of the model is, ln L = constant (T =2) ln jS00 j (T =2) r X ln(1 i ): (14.10) i=1 From this expression two likelihood ratio tests for determining the number of non-zero eigenvalues are formulated. The …rst test concerns the hypothesis that the number of eigenvalues is less than or equal to some given number (q) such that H0 : r q, against an unrestricted model where H1 : r p. The test is given by, 2ln(Q; qj p) = T p X ^ i ): ln(1 (14.11) i=q+1 The second test is used for the hypothesis that the number of eigenvalues is less than the number tested in the previous hypothesis, H0 : r q against H1 : r q + 1, and is given by, 2ln(Q; qj q + 1) = T ln(1 ^ q+1 ): (14.12) Both tests follow non-standard distributions which depend on the number of variables in the system (p), and on the presence of trends and constant terms. The number of non-zero eigenvalue estimates of i are given by the corresponding eigenvectors such that ^ = (^ v1 ; v^2 , ..., v^r )0 . Based on ^ the -vectors can be solved by, 0 ^ = S1k ^ ( ^ Skk ^ ) 1 : (14.13) 0 The estimated matrix = is not identi…ed, in the sense that we can pick 0 any non-singular matrix M (rxr), so that M (M 0 ) 1 = = . There is no unique solution for the co-integrating vectors. This solution, explaining the economic meaning of the co-integrating vectors, is something that the econometrician must impose on the estimates. First, by normalizing each -vector around a variable, and then tests di¤erent assumptions about the vector. By looking at the signs and relative sizes of the ^ -parameters, it is in general possible to …nd appropriate normalization of the -vectors such that the outcome can be understood in terms of error correction mechanisms or long-run equilibrium relationships between economic variables. Assumptions concerning the sizes and relative signs of the parameters can be tested by comparing an unrestricted maximization with one where the restrictions have been imposed. Furthermore, to rule out the cases where yt is integrated of order 2, we must 0 is the mean lag matrix of require that the matrix 0? ? has full rank, where evaluated at unity, and ? and ? are the orthogonal matrices to and such that 0 ? = 0 ? = 0. The system in 14.1 also has a moving average form given by, yt = C(L)( t + + Dt ): (14.14) (L) t ; (14.15) Expression 14.14 can be compared with zt = 0 C(1) t + 0 C from the previous chapter. Since C(L) can be expanded as C(L) = C(1) + (1 integrated of order one we get, yt = y0 + C(1) t X i + C(1) t + C(1) i=1 A DEEPER LOOK AT JOHANSEN’S TEST t X Di + C L)C (L) when yt is (L)( t + Dt ); (14.16) i=1 123 where C (L) = [C(L) C(1)](1 L) 1 . The impact matrix C(1) shows how the non-stationary part of yt is generated from the underlying stochastic and deterministic trends. The link between the MA and the autoregressive form is shown in Johansen (1991), and is given by 0 1 0 ?) ?; 0 ?( ? C(1) = (14.17) 1 where ? and ? are the orthogonal vectors of and respectively. Equation 14.14 can be used to estimate C(1) from given estimates of and . But, since the error terms ( i ;t ) in the reduced form are correlated, the estimate of C(1) is not invariant to di¤erent ways of conditioning on current variables ( yt ). Given this limitation and the assumption that the driving trends should not be a¤ected by the equilibrium forces, the P common trends in the system are represented by 0? yt or alternatively by 0? it , see Juselius (1992). The test procedure can be extended to incorporate variables integrated of order 2 as well. With both I(1) and I(2) processes in the system, two new co-integrating relations are possible. There can be combinations of I(2) variables forming stationary I(0) vectors, or I(2) variables forming non-stationary I(1) vectors which in turn cointegrate with I(1) variables to form stationary vectors. The error correction system in 14.1 can be written as 2 yt = yt 1 + yt 2 + k X2 i 2 yt i + Dt + t ; (14.18) i=1 Pk 1 Pk 1 where = i=1 I + , i = 2: j=i+1 j , and i = 1, ... k If yt is I(2) and yt is I(1); a reduced rank condition for the matrix must be combined with a reduced rank condition for the matrix of …rst di¤erences as well. Johansen (1991) shows that the condition for an I(2) process is 0 ? ? = ' 0; (14.19) 0 where ' and are (p r)xs, with rank s. With I(2) variables yt is I(1). To make these vectors stationary they have to be combined with the vectors of …rst di¤erences ( 2? yt ) to form stationary processes. In the latter expression 2 2 1 vectors, and = ( 0 ) 1 0 2? ( 20 . The ? is the squared orthogonal ? ?) squared orthogonal vectors indicate which variables are I(2). An I(2) model is estimated in a way similar to the I(1) model. Maximum likelihood estimation is feasible since the residual terms of an I(2) model can be assumed to be a Gaussian process. The …rst step is to perform a reduced rank regression for the I(1) model of yt on yt 1 , corrected for the short run dynamics ( yt 1 , ..., yt k+1 ) and the deterministic components ( Dt ). This leads to estimates of r^, ^ and ^ : In the second step, given the estimates of r^, ^ and ^ , a reduced rank test is 0 performed of ^ 0? 2 yt on ^ ? yt 1 , corrected for 2 yt 1 ; ::. 2 yt k+2 , and the constant terms. This leads to the estimates s^, ' ^ , and ^. An I(2) process is harder to analyze in economic terms since the parameters and the test hypotheses have di¤erent interpretations. The tests concerning the vectors are still valid, but are in general only valid for I(1) processes. It is, however, possible to form stationary relations by combining levels ( ^ yt ) with …rst 2 di¤erence expressions ( ^ ? 0 yt ). The practical solution is to identify the I(2) terms and …nds ways of transforming them to I(1) relations. The transformation to an I(1) system can be done by taking …rst di¤erences of I(2) variables or by taking ratios of variables; modeling the real money stock rather than the money stock and the price level separately. 1 An orthogonal vector is often indcated by the sign ? attatched to the original vector. The vector ? is the orthogonal vector to the vector if ? 0 = 0. 124 A DEEPER LOOK AT JOHANSEN’S TEST 15. THE ESTIMATION OF DYNAMIC MODELS (To be completed...) The modelling of stochastic di¤erential models introduce some problems which clearly violate the assumptions behind the classical linear regression model. With some care most of these problems can be solved. The most important factors are whether the data series are stationary, and if the residuals are white noise. As long as the variables are stationary and the residual is a white noise process, OLS estimation is generally feasible. Autocorrelated residuals, however, mean that the OLS estimator is no longer consistent. In this situation the model must either be re-speci…ed, or the whole model including the autoregressive process in the residuals must be estimated by maximum likelihood. To understand the di¤erences between the estimation of stochastic di¤erence equations and the classical linear regression model, we will introduce these di¤erences step by step, in all there are 6 models of interest here, 1 — The classical linear regression model. 2 — Regression with deterministic trends. 3 — Models with stochastic explanatory variables. 4 — Autoregressive models, lagged dependent variable. 5 — Autoregressive models with integrated variables (Testing for unit roots). 6 — Regression models with Integrated variables (Spurious regression and cointegration). The following sections do not present any rigorous proofs concerning the properties of the OLS estimator. The aim is only to review known problems and introduce some new ones. 15.1 Deterministic Explanatory Variables (The Classical Linear Regression model) Starting with yt = xt + t , the matrix form of this model is y=X + ; (15.1) where y is vector of T observations of yt , X a matrix of explanatory variables, the parameters and a vector of residuals of the same dimension as y. (To keep the example simple, is one parameter, but the example could be extended to a multivariate case). The classical case builds upon four assumptions. First, the model is linear, or log linear in variables. Second, the residuals are independent, have a mean of zero and a …nite variance, E( ) = 0; V ar( ) = THE ESTIMATION OF DYNAMIC MODELS 2 I: (15.2) 125 This is basically a statement of correct speci…cation. The model should be set up in such a way that the expected value of the residuals are zero. Third, the explanatory variables are non-stochastic and therefore independent of the errors, E(X 0 ) = 0: (15.3) Finally, the explanatory variables are linearly independent such that rank(X 0 X) = rank(X) = k; (15.4) which ensures that the inverse of (x0 x) exists. Minimizing the sum of squared residuals leads to the following OLS estimator of ; ^ = (X 0 X) 1 (X 0 y) = + (X 0 X) 1 (X 0 ): (15.5) If we simplify the model to one parameter ( ) and one explanatory variable (xt ), we have for a sample of T observations, ^= " T 1X 2 xt + T t=1 # 1 " T 1X xt T t t=1 # ; (15.6) The estimated parameter ^ is equal to its true value and an additional term. For the estimate to be unbiased the last factor must be zero. If we assume that the x0 s are deterministic the problem is relatively easy. A correct speci…cation of the model, E( ) = 0, leads to the result that ^ is unbiased. The parameter has the variance, V ar( ^ ) = E[( ^ )( ^ )0 ] = (x0 x) 1 0 x( 2 I)x(x0 x) 1 = 2 (x0 x) 1 : (15.7) Taking expectations of ^ ;under the assumptions above, veri…es that ^ is an unbiased estimate of ; 1 E( ^ ) = E( ) + E[(X 0 X) 1 (X 0 )] = + E(X 0 X) 1 E(X 0 ); (15.8) where (X 0 X) is a constant when xt is deterministic, and E(X 0 ) = X 0 E( ) = 0, if the residuals have a zero mean. Thus, under these assumptions OLS is unbiased and also consistent (Not proven here). Consistency implies that the var( ^ ) tends to zero as T ! 1. The problem with assuming that the x0 s are non-stochastic is of course that it is an unrealistic assumption in a time series setting. Typically the explanatory variables are as stochastic as the dependent variable. So far we have not made any statements about the distribution of the estimates. OLS has the advantage that it leads to unbiased and e¢ cient estimates under quite general assumptions. However, to make any inference on ^ , we need to make assumptions about its distribution. In most cases the assumption of a normal distribution is reasonable, at least asymptotically, or a reasonable approximation in a limited sample, leading to (^ ) N ID(0; 2 I): (15.9) Thus, the limiting distribution of is normal, and since ( ^ ) is a white noise process we know that it converges to the true sample moment with the speed given 1 by the standard error of a white noise process, 1=T 2 : 1 The expectation of an expectation is equal to the expectation E( ^ ) = ^ ;and the expectation of a constant is equal to the constant, since true parameter can be treated as a constant we have E( ) = . 126 THE ESTIMATION OF DYNAMIC MODELS 15.2 The Deterministic Trend Model A situation when the assumption of deterministic explanatory variables can make sense in a time series setting is when the dependent variable is driven by a deterministic trend.2 Suppose that the explanatory variable is a deterministic time trend, yt = + t + t ; (15.10) where t is a time trend, t = 1; 2; 3; ::. T , without stochastic variation. If the time trend is adjusted for its mean t~ = (t t), the constant term ( ) will measure the unconditional mean of yt . Under the assumption that yt has a su¢ ciently large deterministic trend component, w:r:t to the sample size, the error terms from this regression can be understood as the detrended yt series. Assume that both yt and t have been corrected for their means, OLS leads to " T 1 X ~2 t + T ^= t=1 # 1 " T 1 X~ t T t=1 t # : (15.11) Taking expectations leads to the result that is unbiased. The most important reason why this regression works well is that there is an additional t~ variable in the denominator. As t~ goes to in…nity the denominator gets larger and larger compared to the numerator, so the ratio goes to zero much faster than otherwise. 15.3 Stochastic Explanatory Variables Applying OLS to time series data introduces the problem of stochastic explanatory variables. The explanatory variable can be stochastic on their own, and lags of the dependent variables imply stochastic regressors. Let the model be, yt = xt + t ; (15.12) where xt is generated by the covariance stationary stochastic process fxt gT1 . The OLS estimator leads to " T # 1" T # X X 2 ^= + xt xt t : (15.13) t=1 t=1 Taking expectations of the expression leads to " T # 1 X 2 E xt for the …rst factor and (15.14) t=1 E " T X t=1 xt t # for the second factor. (15.15) In the classical linear regression case xt is assumed to be deterministic implying that [X 0 X] is a constant and that E(xt t ) = xt E( t ) = 0. Here xt is a random variable, so additional assumptions must be made for the OLS estimator. 2 Other realistic examples in economics are deterministic dummy variables and deterministic seasonal components. THE DETERMINISTIC TREND MODEL 127 The necessary conditions are that fxt gT1 is stationary process and that fxt gT1 and f t gT1 are independent. The …rst condition means that we can view the covariance matrix (X 0 X) as …xed in repeated samples. In a time series perspective we cannot generally talk about repeated samples, instead we have to look at the sample moments as T ! 1. If xt is a stationary variable then we can state that as T ! 1; the covariance matrix will become a constant. This can be written as, T 1X xt xt !p Q; T t=1 (15.16) meaning that the expression will converge in probability to a constant Q.3 An alternative way to show the properties of OLS in the case of stochastic explanatory variables is to use the probability limit operator (p lim), p lim [X 0 X] = Q: A convenient property of p lim operators is that p lim(x 1 ) = [p lim(x)] 1 . Here, it remains to look at the numerator in the OLS expression. If fxt gT1 and f t gT1 are generated by two independent stochastic processes we have, for each pair of observations, that E(xt t ) = E(xt )E( t ), it can then be shown that T 1X xt T t=1 !p 0; (15.17) p lim [X 0 ] !p 0: (15.18) t or, alternatively, The intuition behind this result is that, because t is zero on average, we are multiplying xt with zero. It follows then that the average of (xt t ) will be zero. The practical implication is that given a su¢ ciently large sample the OLS estimator will be unbiased, e¢ cient and consistent even when the explanatory variables are stochastic variables. If t N ID(0; 2 ), we also have, conditional T on the stochastic process fxt g1 ; that the estimated is distributed as, ^ jx t N[ ; (^ ) jxt 2 (X 0 X) 1 2 ); ] (15.19) and N (0; (15.20) making ^ an unbiased and consistent estimate, with a normal distribution such that standard distributions can be used for inference. The example can be extended by two assumptions. First let the residuals be et iid(0, 2 ), they are independent and identically distributed as before, but not necessarily normal. Second, let the process fxt gT1 be only covariance stationary in the long run, allowing time in a limited sample, Pt the sample covariance to vary with E(X 0 X) = (1=T ) xt xt = Qt . The processes fxt gT1 and f t gT1 are independent as above. Under these conditions the estimated is, " # T X ^ = + Qt 1 (1=T ) xt t (15.21) t t=1 The estimated t can vary with t since Qt varies with time. To establish that OLS is a consistent estimator we need to establish that " # 1 T X (1=T ) xt xt = Qt 1 !p Q 1 (15.22) t=1 3 In 128 a multivariate model we would say that Q converges to a matrix of constants. THE ESTIMATION OF DYNAMIC MODELS The condition holds if fxt gT1 is covariance stationary, as T goes to in…nity the estimate will converge in probability ( !p ) to a constant. The second condition PT is that the sum t=1 xt t converges in probability to zero, which takes place whenever xt and t are independent. The error process is iid, but not necessarily normal. Under the conditions given PT here, the central limit theorem is su¢ cient to establish that the sequence { t=1 xt t gconverges (weakly in distribution) to a normal distribution, T X [(1=T ) xt t ] !d N (0; 2 ); (15.23) t=1 so that ( ^ t ) is asymptotically distributed as (^t ) 2 N (0; ): (15.24) In a limited sample the normal distribution will be an approximation. The result is necessary for using t, 2 and F -distributions for inference on ^ and ^ 2 . To see how the last result works, recall the central limit theorem (CLT ). The CLT states that for a sample mean of an iid process zT , as T the sample size increases this will weakly converge to a normal distributed variable so for the sequence 1 ) ) N (0; 2 ); (15.25) (1=T 2 )(zT where is the population mean of zt . From the OLS estimator we have, (^t ) = [(1=T ) T X xt xt ] 1 [(1=T ) t=1 1 T X xt t ]: (15.26) t=1 1 Since (1=T ) = (1=T 2 )(1=T 2 ) the CLT can be evoked by rewriting the expression as, 1 (1=T 2 )( ^ t ) = [(1=T ) T X xt xt ] 1 1 [(1=T 2 ) t=1 T X xt t ]; (15.27) t=1 where the LHS and the numerator on the RHS correspond to the CLT theorem. From the numerator, on the RHS, we get as T goes to in…nity 1 [(1=T 2 ) T X t=1 xt t ] ) N (0; 2 ): (15.28) 1 Moreover, we can also conclude that the rate of convergence is given by (1=T 2 ). 1 1 Dividing the RHS side of the OLS estimator with (1=T 2 ) leaves (1=T 2 ) in the denominator which then represents the speed by which the estimate ^ t converges to its true value . 15.4 Lagged Dependent Variables Let us now turn to the AR(1) model, yt = yt LAGGED DEPENDENT VARIABLES 1 + t; (15.29) 129 where t iid(0, 2 ). (The estimation of AR(p) models follows from this example in a straightforward way). The estimated is " ^= T 1 T X yt 1 2 t=1 # 1 " 1 T T X yt 1 yt t=1 # ; (15.30) leading to " ^ = T 1 T X yt 1 t=1 2 # 1 " T 1 T X yt 1 t t=1 # : (15.31) This is similar to the stochastic regressor case, but here fyt 1 g and f t g cannot be assumed to be independent, so E(yt 1 )E( t ) 6= 0 and ^ can be biased in a limited sample. The dependence can be explained as follows t is dependent on yt , but yt is through the AR(1) process correlated with yt+1 , so yt+1 is correlated with t+1 . The long-run covariance (lrcov) between yt 1 and t is de…ned as, lrcov(yt " t) = T 1 1 1 X t=1 yt 1 t # 1 + X k=1 1 E(yt 1 t+k ) + X E(yt 1+k t ); (15.32) k=1 where the …rst term on the RHS is sample estimate of the covariance, the last two terms capture leads and lags in the cross correlation between yt 1 and t . As long as yt is covariance stationary and t is iid, the sample estimate of the covariance will converge to its true long-run value. This dependence from t to yt+1 is not of major importance for estimation. Since (yt 1 t ) is still a martingale di¤erence sequence w.r.t. the history of yt and t , we have that Efyt 1 t j yt 2 ; yt 3; :::; t 1 ; t 2; :::g = 0, so it can be established in line with the CLT; 4 that # " T X 1 yt 1 t ) N (0; 2 Q): (15.33) (1=T 2 ) t=1 Using the same assumptions and notation as above the variance is given is E(yt 1 t t yt 1 ) = E( 2 )E(yt 1 yt 1 ) = 2 Qt : These results are su¢ cient to establish that OLS is a consistent estimator, though not necessarily unbiased in a limited sample. It follows that the distribution of the estimated , and its rate of convergence is as above. The results are the same for higher order stochastic di¤erence models. 15.5 Lagged Dependent Variables and Autocorrelation In this section we look at the AR(1) model with an autoregressive residual process. Let the error process be, (15.34) t = t 1 + t; where 4 This 130 t iid(0, 2 ). In this case we get the following expression, result is established by the so called Mann-Wold theorem THE ESTIMATION OF DYNAMIC MODELS T X yt = 1 t t=1 T X [yt 1( t 1 + vt )] = t=1 T X = T X t=1 yt 1 t 1 + t=1 [yt 1 (yt 1 yt 2 )] t=1 = T X + T X yt 1 vt t=1 T X yt 1 vt t=1 yt2 1 T X yt 1 yt t=1 2+ T X yt 1 vt : (15.35) t=1 Dividing the expression with (1=T ) and taking expectations " E (1=T ) T X t=1 yt 1 t # = var(yt ) + cov(yt 1y 2) + cov(yt 1 vt ); (15.36) which establishes that the OLS estimator is biased and inconsistent. Only the last covariance term can be assumed to go to zero as T goes to in…nity. In this situation OLS is always inconsistent.5 Thus, the conclusion is that with a lagged depended variable OLS is only feasible if there is no serial correlation in the residual. There are two solutions in this situation, to respecify the equation so the serial correlation is removed from the residual process, or to turn to an iterative ML estimation of the model (yt yt 1 t 1 = vt ). The latter speci…cation implies common factor restrictions, which if not tested is an ad hoc assumption. The approach was extremely popular in the late 70s and early 80s, when people used to rely on a priori assumptions in the form of adaptive expectations or costly adjustment, as examples, to derive their estimated models. Often economists started from a static formulation of the economic model and then added assumption about expectations or adjustment costs. These assumptions could then lead to an in…nite lag structure with white noise residuals. To estimate the model these authors called upon the so called Koyck transformation to reduce the model to a …rst order autoregressive stochastic di¤erence model, with an assumed …rst order serially correlated residual term. 15.6 The Problems of Dependence and the Initial Observation An additional problem is that of dependent observations. When we derived the estimators, in particular the MLE, we must assume that the observations are drawn independent distribution. A basic assumption is therefore violated, because the observations in a typical time series model are dependent. The AR(1) can serve as an example, xt = axt 1 + t N ID(0; 2 ): (15.37) t ~ t is dependent on the observation of xt in In this model each observation of X the previous period. How does this a¤ect the ML estimator? Suppose the sample 5 Asymptotically, though, the estimates have normal distributions, because the long-run bias converges to a constant while the eroor process vt converges to NID(0; 2 ): This is a result of the CLT. THE PROBLEMS OF DEPENDENCE AND THE INITIAL OBSERVATION 131 only consists of two observations x1 and x2 . The joint density function for these two observations can be factorised as, D(x1 ; x2 ) = D1 (x2 j x1 )D2 (x1 ) (15.38) Extend the sample to 3 observations and we get, D(x1 ; x2 ; x3 ) = = D1 (x3 j x2 ; x1 )D2 (x2 ; x1 ) D1 (x3 j x2 ; x1 )D2 (x2 j x1 )D3 (x1 ): (15.39) (15.40) With three observations, we have that the joint probability density function is ~ 3 , conditional on X ~ 2 and X ~ 1 , multiplied by the equal to the density function of X ~ ~1. conditional density for X2 , multiplied by the marginal density for X It follows that for a sample of T observations, the likelihood function can be written as, L( ; x) = T Y t=2 D(xt j Xt 1; )f (x1 ); (15.41) where Xt 1 represents the observations up to and including xt 1 . Now, the AR(1) model implies that the conditional density function of x, D(xt j xt 1 , ..., x1 ) is normally distributed with mean a1 xt 1 and variance 2 . The log likelihood function is, log L(a1 , 2 ; x) = [(T 1)=2] log 2 PT [(T 1)=2] log 2 (1=2 2 ) t=2 (xt a1 xt 1 )2 + log D(x1 ): This looks like the expression for the MLE derived earlier, with the exception of the last term, the log likelihood for the very …rst observation. By de…nition, the …rst observation here contains the initial conditions for the model, meaning everything that happen up to and including the …rst period of the sample. The question is, how do we get rid of this term? A practical solution is to assume that x1 can be treated as a …xed value in repeated realizations. (Compare with stochastic regressor case in OLS). In this case log f (x1 ) can be seen as a constant which can be left out of the MLE because it will not a¤ect the estimates of the parameters. ~ t is stationary and normally distribAn alternative way is to assume that X uted. The absolute value of a1 will be less than one. The unconditional normal 2 ~ 1 is therefore known to have mean zero and variance 2 =(1 distribution of X ). The likelihood becomes, log L(a1 , 2 ; x) = (T =2) log 2 (T =2) log 2 +(1=2) log(l a1 2 ) (1=2 2 )(1 a1 2 )x1 2 PT (1=2 2 ) t=2 (xt a1 xt 1 )2 + log D(x1 ): Unfortunately the log likelihood is no longer log-linear. The most convenient solution in this case is to drop the third and the fourth terms from the likelihood, with the argument that we are only dealing with one observation why the asymptotic properties of the estimator should be unchanged. The conclusion would be ~ 1 is …xed in repeated samples. the same if we assume that X Finally, the most di¢ cult way of dealing with the situation is to use the sample ~ t . This would be recommended information to estimate the initial conditions of X if we are modeling non-stationary variables where the distribution of the initial value might di¤er to a large extent from the following observations. (An example of this can be found in Bergstrom (1989). 132 THE ESTIMATION OF DYNAMIC MODELS 15.7 Estimation with Integrated Variables (To be completed and extended) In this section we investigate the problems of estimating integrated series. An integrated variable can be de…ned as, “A series (xt ) with no deterministic component and which has a stationary and invertible autoregressive moving average (ARMA) representation after di¤erencing d times, but which is not a stationary after di¤erencing only d 1 times, is said to be integrated of order d, denoted x (d):” (Banerjee et. al. (1993)] In many areas were time series techniques are applied integrated variables are rare exceptions, which are seldom interesting to analyse. In economics this is not the case, most macroeconomic time series appear to be integrated or nearly integrated series, see Nelson and Plosser (1982). Thus, the estimation and distribution of sample estimates are of great importance in economics, especially since regression with integrated variables often results in spurious correlations when standard distributions are used for inference. The simplest example of an I(1) series is the random walk model yt = yt 1 + t , where t N ID(0, 2 ). Taking the …rst di¤erence of this variable results in a stationary I(0) series according to the de…nition given above. If yt is generated as an integrated series, the main problem with estimating a random walk model, yt = yt 1 + t; (15.42) is that the estimated is not following a normal distribution, not even asymptotically. The problem here is not inconsistency, but the nonstandard distribution of the estimated parameters. This is clearly established in Fuller(1976) where the results from simulating the empirical distribution of the random walk model is presented. Fuller generated data series from a driftless random walk model, and estimated the following models, a) yt = yt 1 + t ; b) yt = + yt 1 + t ; c) yt = + (t t) + yt 1 + t ; where is constant and (t t) a mean adjusted deterministic trend. These equations follow from the random walk model. The reason for setting up these three models is that the modeler will not now in practice that the data is generated by a driftless random walk. S=he will therefore add a constant (representing the deterministic growth trend in yt ) or a constant and trend. The models are easy to understand, simply subtract yt 1 from both sides of the random walk model, yt yt 1 = yt 1 1)yt + yt 1 + t; (15.43) which leads to yt 1 =( 1 t = yt 1 + t: (15.44) Thus, if yt 1 is integrated of order one = 0. The problem here is that since does not follow a standard distribution the conventional t-statistic cannot be used. This would not be a problem if equals say 0.99, then j j < 1 and the series would be stationary, and its asymptotic behaviour would be like the AR(1) model above. Fuller’s simulations of the empirical t distribution of the estimated in the three model showed that they did not converge to the normal distribution. With these results he established what is now know as the Dickey-Fuller distributions. Furthermore, the divergences compared to the normal distributions are huge. So, here is a case were the central limit theorem does not work. ESTIMATION WITH INTEGRATED VARIABLES 133 The standard t-statistic for an in…nitely large sample is for a two sided test of ^ 6= 0 equal to 1.96 at the 5 % level. However, according to the simulations of Dickey and Fuller the appropriate value of the t-statistic in model (a) is 2.23, for an in…nity large sample. In an autoregressive model we know that the estimate of is biased downward. Thus, the alternative hypothesis in models (a) to (c) is that is less than zero. The associated asymptotic t-value for an estimate from a normal distribution, is therefore -1.65. Dickey and Fuller established that the asymptotic critical values for one sided t-tests at the 5 % level in the models (a) to (c) are -1.95, -2.86 and -3.41 respectively. (See Fuller (1976) Table 3.2.1, page 373]. Notice that the critical values change depending on the parameters included in the empirical model. Also, the empirical distributions assume white noise residual; if this is not the case, either the model or the test statistic must be adjusted. Moreover, as long as = 1 or = 0 cannot be rejected, the estimated constant term in model b, as well as the constant and the quadratic trend in model c, also follow non-normal distributions. These cases are tabulated in Dickey and Fuller (1981). The consequence of ignoring the results of Dickey and Fuller is obvious. If using the standard tables, one will reject the null hypothesis of = 0( = 1:0) too many times. It follows that if you use standard t-tests you will end up modelling non-stationary series, which in turn take you to the spurious regression problem. The alternative hypotheses for unit root tests are discussed in the following chapter. The explanation to why the t-statistic ends up being non-normally distributed, can be introduced as follows. As T goes to in…nity, the relative distance between yt and yt 1 becomes smaller and smaller. Increasing the sample size implies that the random walk model goes towards a continuous time random walk model. The asymptotic distribution of such a model is that of a Wiener process (or Brownian motion). The OLS estimate is, (^ " 1 1:0) = T T X yt t=1 1 2 # 1 " T 1 T X yt t=1 1 t # ; (15.45) Pt where, since yt is driven by stochastic trend, yt = i=1 t i , the sample moments of the two factors on the RHS will not converge to constants, but to random variables instead. These random variables will have a non-standard distribution, often called a Dickey-Fuller distribution. We can express this as, (^ 1:0) ) [Wyy (t)] 1 [Wy (t)] : (15.46) where W (t) indicates that the sample moment converges to a random variable which is a function if a Wiener process and therefore distributed according to a non-standard distribution. If the residuals are white noise then we get the so called Dickey-Fuller distributions. The intuition behind this result is that an integrated variable has an in…nite memory so the correlation between yt 1 and t does not disappear as T grows. The nonstandard distribution remains, and gets worse if we choose to regress two independent integrated variables against each other. Assume that xt and yt are two random walk variables, such that yt = yt 1 + t; and xt = xt 1 + t; (15.47) where both t and t are N ID(0; 2 ). In this case, would equal zero in the model, yt = xt + t : The estimated t-value from this model, when yt and xt are independent random walks should converge to zero. This is not what happens when yt and xt are also 134 THE ESTIMATION OF DYNAMIC MODELS integrated variables. In this case the empirical t-value will converge to 2.0, leading to spurious correlation if a standard t-table at 5% is used to test for dependence between the variables. The problem can be described as follows. If is zero, the residual term will be I(1) having the same sample moments as yt . Since yt is a random walk we know that the variance of t will be time dependent and non-stationary as T goes to in…nity. The sample estimate of 2 t is therefore not representative for the true long run variance of the yt series. The OLS estimator gives # 1" # " T T X 1X 1 2 ^= + xt xt t ; (15.48) T T t=1 t=1 which, if the variables are integrated, converges to (^ ) ) [B1 (t)] 1 [B2 (t)] ; (15.49) where B1 and B2 represent sample moments that are functions of random variables, which follow a Brownian motion (Wiener process). The intuition here is that in the long-run random walk variables collapse to its continuous time counterpart, which is the Brownian motion (the Wiener process). The important di¤erence is that instead of having sample moments which are constant in the long run, we have a ratio between two random variables which are function of Brownian motions. In this situation the distribution of the estimated parameters end up following non-standard distributions. It is easy to understand why this is a bit problematic, just recall that a random walk can be written as the sum of all shocks to the series, plus the initial value. In other words, the sample moments in this case are sums of partial sums, since each observation of xt can be written as a sum of shocks. The estimated parameter will still converge to its true sample moment. The variance, however, will be di¤erent. It can be shown that in this case the sample moment of is, ( ^ ) NID[0, 2 (t)] where the variance is a function of time. The estimate of is still asymptotically correct and normally distributed but its variance is the variance of a Brownian motion. It can also be show, that the convergence of ^ to its true value is much faster than under OLS. Stock (1985) 1 showed that the rate of convergence is 1=T , instead of the standard OLS rate 1=T 2 . This is known as super convergence. Unfortunately, this is only an asymptotic result. In most applications the short run dynamics between the variables will seriously bias the OLS estimate in this situation. The consequence is that if ones tries to use standard tables, like t or F to test the signi…cance of , one might not be able to reject spurious results. The true \t-values” of this model will be much higher. If one regresses one ore more independent random walks against each other, standard t and F tables become useless and will lead the researcher to accept hypotheses of correlation when there is no correlation what so ever. These results might look like a special case, but they are not. In fact they carry over to small sample estimates involving all types of integrated and near integrated variables. The distributions of the parameters based on strongly autocorrelated data are closer to the ones of a random walk, than those of standard stationary normal variables. These results stress the importance of testing for the type of non-stationarity, order of integration, and presence of cointegration when working with time series. Otherwise one can easily fall into the spurious regression trap. The problems are likely to carry on even to the situation when is di¤erent from zero. In this case t will be stationary, but the distribution of is nonstandard as long as the two residual terms t and t are dependent. In general, without a priori knowledge, the estimated standard errors from integrated variables must be ESTIMATION WITH INTEGRATED VARIABLES 135 assumed to follow nonstandard distributions. The estimated equation must either be modi…ed, or cointegration tests must be carried out. 136 THE ESTIMATION OF DYNAMIC MODELS 16. ENCOMPASSING Often you will …nd that there are several alternative variables that you can put into a model, there might be several measures of income, capital or interest rates to choose from. Starting from a general to a speci…c model, several models of the same dependent variables might display, white noise innovation terms and stable parameters that all have signs and sizes that are in line with economic theory. A typical example is given by Mankiw and Shapiro (1986), who argue that in a money demand equation, private consumption is a better variable than income. Thus, we are faced with two empirical models of money demand.1 The …rst model is, mt = 0 + 1 yt + 2 yt 1 + 3 ry mt = 0 + 1 ct + 2 ct 1 + 3 rt + t (16.1) t: (16.2) and the second is + Which of these models is the best one, given that both can be claimed to be good estimates of the data generating process? The better model is the one that explains more of the systematic variation of mt and explains were other models go wrong. Thus, the better model will encompass the not so good models. The crucial factor is that yt and ct are two di¤erent variables, which leads to a non-nested test. To understand the di¤erence between nested and non-nested tests set 2 = 0: This is a nested test because it involves a restriction on the …rst model only. Now, set 1 = 2 = 0; this is also a nested test, because it only reduces the information of model one. If 1 = 2 = 0, this is also a nested test of the second model. Thus, setting 1 = 2 = 0; or 1 = 2 = 0, are only special cases of each model. The problem that we like to address here is whether to choose either yt or ct as ”the scale variable” in the money demand equation. This is non-nested test because the test can not be written as a restriction in terms of one model only. The …rst thing to consider is that a stable model is better than an unstable one, so if one of the models is stable that is the one to choose. The next measure is to compare the residual variance and choose the model with the signi…cantly smaller error variance. However, variance domination is not really su¢ cient, PcGive therefore o¤ers more tests, that allow the comparison of Model one versus Model two, and vice versa. Thus, there are three possible outcomes, Model one is better, Model two is better, or there is no signi…cant di¤erence between the two models. 1 For simplicity we assume that there is only one lag on income and consumption. This should not be seen as a restriction, the lag length can vary between the …rst and the second model. ENCOMPASSING 137 138 ENCOMPASSING 17. ARCH MODELS Autoregressive Conditional Heteroscedasticity (ARCH) means that the variance of a process changes in a systematic way over time. Why should one bother about heteroscedasticity in time series models? Heteroscedasticity is often viewed as unimportant in time series modeling, except the fact that it leads to ine¢ cient estimates. Recall the linear regression model, yt = xt + t N (0; t 2 ); (17.1) where the residual variance, the variance of the conditional mean of yt ( 2 ) is usually assumed to be constant over time. In principle, however, nothing prevents the variance from varying over time ( 2t ). There are four reasons why this type of heteroscedasticity is important in time series models. The …rst is that any departures from having white noise residuals is a sign of misspeci…cation. Heteroscedasticity tests represents a way of detecting misspeci…cations originating from leaving out an important explanatory variable, which is totally orthogonal to the other explanatory variables in the model. Second, if the variance of the model is changing over time so will the forecast intervals of the model. Hence, for the purpose of making better predictions ARCH is of interest because it leads to better forecast con…dence intervals. One example is so-called Value at Risk (VaR) models which are used to forecast the level of reserves to meet cash ‡ow ‡uctuations. Third, the modeling of ARCH disturbances is sometimes implied by theory, and in general it makes sense from economic theory in many situations. ARCH represents a time series approach to the variance component, which picks up e¤ects not otherwise included in the model. Various types of time varying risk premiums are examples of this. Variables such as time varying risk premiums are di¢ cult to observe and measure. But, we can trace their e¤ects on the variance in a model like the one above. Examples of applications are intertemporal asset market models, CAPM, exchange rate markets, etc. Fourth, in option prices depends critically on expected future variances of price of the underlying asset. ARCH models o¤ers a way of forecasting variances such that pricing can be more exact, and more pro…table for those who are able to make better forecasts. An example of an ARCH(1) model is provided by, yt = xt + ht = !+ 1 t 2 t t 1; N (0; 2 ) (17.2) (17.3) where the error variance is dependent on its lagged value. The …rst equation is referred to as the mean equation and the second equation is referred to as the variance equation. Together they form an ARCH model, both equations must estimated simultaneously. In the mean equation here, xt is simply an expression for the conditional mean of yt . In a real situation this can be explanatory variables, an AR or ARIMA process. It will be understood that yt is stationary and I(0), otherwise the variance will not exist. This example is an ARCH model of order one, ARCH(1). ARCH models can be said represent an ARMA process in the variance. The implication is that a high variance in period t-1 will be followed by higher variances in periods t, t + 1, t + 2 etc. How long the shock persists depends, as in the ARMA model on the size of ARCH MODELS 139 the parameters in combination with the lag lengths. A low variance period is likely to be followed by another low variance period, but a shock to the process and/or its variance will cause the variance to become higher before it settles down in the future. A consequence of An ARCH process is that the variance can be predicted. In other words it is possible to predict if the future variances, and standard errors will be large or small. This will improve forecasting in general and is useful tool for the pricing of derivative instruments. An ARCH(q) process is, yt ht = xt + = !+ t t 2 1 t 1 + D(0; ht ) 2 2 t 2 (17.4) 2 q t q + ::: = 0 + q X i 2 t i: (17.5) i=1 The expression for the variance shows a autoregressive process in the variance of : Deliberately the distribution of the residual term is left undetermined. In ARCH models normality is one option, but often the residual process will be non-nonrmal and often display thicker tails, and be leptokurtic. Thus, other distributions such as the Student t-distribution can be a better alternative. The t-distribution has three moments, the mean, the variance and the "degrees of freedom of the Student t-distribution". In this case, if the residual process t St(0; h2 ; ); where is a positive parameter that measures the relative importance of the peak in relation to the thickness of the tails. The Student t distribution is a symmetrical distribution that contains the normal distribution as a special case, as ! 1: The ARCH process can be detected by testing for ARCH and by inspecting the P ACF and ACF of the estimated squared residual ^2t : As is the case for AR models, ARCH has a more general form, the Generalised ARCH, which implies lagging the dependent variable ht : A long lag structure in the ARCH process can be substituted with lagged dependent variables to create a shorter process, just as for ARMA processes. A GARCH(1,1) model is written as, yt = xt + ht = !+ 1 t 2 t t 1 D(0; ht ) + ht 1: (17.6) (17.7) The GARCH(1,1) process is a very typical process found in a number of empirical applications on ARCH processes. The convention is to indicate the length of the ARCH with q, and use the letter p to indicate the length of the lagged variance ht : The same convention assigns to the ARCH process, and to the GARCH process. Usually ! is usd for the constant time independent part of the variance instead of the 0 that is used here. For an asset market this type of process would imply that there are persistent periods when asset prices ‡uctuate relatively little compared with other periods where prices ‡uctuate more and for longer times. A General GARCH(q,p) process is, yt = ht = xt + t q X !+ i=1 t i 2 t D(0; ht ) p X i+ i ht i : (17.8) (17.9) i=1 ARCH and GARCH models cannot be estimated by OLS, or standard regression programs. It is necessary to use an interativre system estimation method because the model is now consisting of two equations; the mean equation and the 140 ARCH MODELS variance equation, where the variance equation dependends on estimates in the mean equation. In the example above, the additional parameters are w and ;that must be estimated in the same model. Therefore, some iterative ML estimator is necessary (special algorithms are also necessary). Gauss is a good program for estimating ARCH models, but takes some investment to learn, SAS (from ver. 6.08) is quite good, EViews is also good with excellent help facilities, RATS is an alternative. Finally, PcGive 10, can also do ARCH and GARCH models. A practical problem in estimation is that in a …nite sample the estimated variance (ht ) there is no guarantee that the variance will be a positive number. For that reason, software will o¤er you the opportunity to restrict the values of ; as well as the sum of the : s and : s sums to positive numbers. 17.0.1 Practical Modelling Tips In practical modelling it is necessary to start with the mean equation. It is necessary to have a correct speci…cation of the mean equation, in order to get the variance process right. A stationary autoregressive process and relevant explanatory variables, and possible sesonal and other dummies must be included in the mean equation to get rid of autocorrelation and general misspeci…cation. This is a a relatively easy procedure for …nancial return series, which often martingale processes Notice that ARCH and GARCH disappears with aggregation over time and low frequencies in recording data. Thus, ARCH=GARCH is typically never found for frequencies above months. Monthly data, or shorter intervals, are necessary for the modelling of ARCH=GARCH process. Even if models estimated with quarterly data and higher frequencies can display ARCH in testing the residuals, it is usually never possible to build an ARCH=GARCH models with that type of data. An ARCH process can be identi…ed by testing for ARCH(q) structure in combination with using ACF : s and P ACF : s on the squared residuals from the mean equation. Estimate the mean equation, save the estimated residuals, square them and use ispect the ACF : sand P ACF : s of these squared residuals to identify a preliminary lagorder for the GARCH. However, this method is higly approximative regarding the order of q and p. 17.1 Some ARCH Theory To explore ARCH models, let us start with the following AR(1) model, which could represent an asset price, yt = yt 1 + t; (17.10) where E( t ) = 0, V ar( t ) = 2 and j j < 1. (Thus, the model is stable and yt is stationary). P Furthermore let us assume that the unconditional mean of yt is E(yt ) = (1=T ) yt , which is not dependent on time. The expected value of yt+1 , conditional on the past history of yt is Et (yt+1 j yt ) = yt ; SOME ARCH THEORY (17.11) 141 which varies over time since yt is a random variable. Now turn to the variance of yt+1 V ar(yt+1 ) = V ar( yt ) + V ar( t ): (17.12) This variance consists of two parts, …rst we have the unconditional variance of yt+1 which is, for an AR(1) given by, 2 V ar(yt+1 ) = 2 1 : (17.13) Second we have the conditional variance of yt+1 E(yt+1 j yt )]2 = V art (yt+1 jyt ) = E[yt+1 2 : (17.14) We can see that while the conditional expectation of yt+1 depends on the information set It = yt , both the conditional (V art ) and the unconditional variances (Var) do not depend on It = yt . If we extend the forecasts k periods ahead we get, by repeated substitution, yt+k = k yt + k X k i t i: (17.15) i=1 The …rst term is the conditional expectation of yt k periods ahead. The second term is the forecast error. Hence, the conditional variance of yt k periods ahead is equal to k X 2(k i) V art (yt+k ) = 2 : (17.16) i=1 It can be seen that the forecast of yt+k depends on the information at time t. The conditional variance, on the other hand, depends on the length of the forecast horizon (k periods into the future), but not on the information set. Nothing says that this conditional variance should be stable. Like the forecast of yt it could very well depend on available information as well, and therefore change over time. So let us turn to the simplest case, where the errors follow an ARCH(1) model. We have the following model, yt = yt 1 + t where t D(0, ht ), E( t ) = 0, E( t t i ) = 0 for i 6= 0, and ht = w + t 2 : The process is assumed to be stable j j < 1, and since t 2 is positive we must have w > 0 and 0. Notice that the errors are not autocorrelated, but at the same time they are not independent since they are correlated in higher moments through the ARCH e¤ect. Thus, we cannot assume that the errors really are normally distributed. If we chose to use the normal distribution as a basis for ML estimation, this is only an approximation. (As an alternative we could think of using the t-distribution since the distribution of the errors tends to have fatter tails than that of the normal). Looking at the conditional expectations of the mean and the variance of this process, Et (yt+1 jyt ) = yt and V art (yt+1 jyt ) = ht+1 = w + (yt yt )2 : We can see that both depend on the available information at time t. Especially it should be noticed that the conditional variance of yt+1 increases by positive and negative shocks in yt : Extending the conditional variance expression k periods ahead, as above, we get, k X 2(k i) V art (yt+k jyt ) = Et (ht+k ): (17.17) i=1 where Et (ht+k ) is the conditional expectation of the error variance k periods ahead. To solve for the latter, and express the forecast in the same way as the one 142 ARCH MODELS for the conditional mean, let us turn to the unconditional variance if E( t t ) = 2 . In terms of ht ; 2 t =w+ (1 L) t, which is, 2 t 1; (17.18) = w; (17.19) = w: (17.20) from which we get which, since 2 t = 2 2 t , implies that, (1 2 ) Substitute by ht ; ht = (1 ) 2 2 t 1; + (17.21) to get the relationship between the conditional and the unconditional variances of yt . The expected value of ht in any period i is, E(ht+i ) = 2 + E[ht+i 2 1 ]: (17.22) Repeated substitution leads to the conditional variance k periods ahead, V art (yt+k jyt ) = 2 k X1 2i + s 1 (ht+1 2 ) k X1 2i i : (17.23) i=0 i=1 The …rst term on the RHS is the long run unconditional forecast variance of yt . The second term represents the memory in the process, given by the presence 2 of ht+1 . If < 1 the in‡uence of (ht+1 ) will die out in the long run and the second term vanishes. Thus, for long-run forecasts it is only the unconditional forecast variance which is of importance. Under the assumption of < 1 the memory in the ARCH e¤ect dies out. (Below we will relax this assumption, and allow for unit roots in the ARCH process). 17.2 Some Di¤erent Types of ARCH and GARCH Models ARCH models represent a class of models were the variance is changing over time in a systematic way. Let us now de…ne di¤erent types of ARCH models. In all these models there is always a mean equation, which must be correctly speci…ed for the ARCH process to be modeled correctly. 1) ARCH(q); the ARCH model of order q, ht = 0 + q X i 2 t i = 0 + A(L) t 2 : (17.24) i=1 This is the basic ARCH model from which we now introduce di¤erent e¤ects. 2) GARCH(q; p): Generalized ARCH models. If q is large then it is possible to get a more parsimonious representation by adding lagged ht to the model. This is like using ARMA instead of AR models. A GARCH(q; p) model is SOME DIFFERENT TYPES OF ARCH AND GARCH MODELS 143 ht = 0 + q X 2 i + t i i=1 p X ht i i = 0 + A(L) t 2 + B(L)ht ; (17.25) i=1 where p 0, q P > 0, a0P > 0, i 0, and i 0. The sum of the estimated parameters (1) = + shows the memory of the process. Values of (1) i i equal to unity indicates that shocks to the variance has permanent e¤ects, like in a random walk model. High values of (1); but less than unity indicates a long memory process. It takes a long time before shocks to the variance disappears. If the roots of [1 B(L)] = 0 are outside the unit circle we the process is invertible and, ht = 0 [1 = 0 " = a+ 1 B(L)] 1 p X i i=1 D(L) 2 t # + A(L)[1 1 + 1 X B(L)] i 1 2 t 2 t i (17.26) (17.27) i=1 ARCH(1): (17.28) If D(L) < 1 then GARCH = ARCH. Moreover, if the long run solution of the model B(1), is < 1, the i will decrease for all i > max(p, q). GARCHmodels are standard tools, in particular, for modeling foreign exchange rate markets and …nancial market data. Often the GARCH(1; 1) is the preferred choice. GARCH models some empirical observations quite well. The distribution of many …nancial series display fatter tails than the standard normal distribution. GARCH models in combination with the assumption of a normal distribution of the residual can generate such distributions. However, many series, like foreign exchange rates, display both fatter tails and are leptokurtic (the peak of the distribution is ’higher’than the normal. A GARCH process combined with the assumption that the errors follow the t-distribution can generate this type observed data. Before continuing with di¤erent ARCH models, we can now look at an alternative formulation of ARCH models which show their similarities with ordinary time series models. De…ne the innovations in the conditional variance as, vt = 2 t ht : (17.29) The variable vt can be thought of as surprises in volatility, arising from new, unexpected, information on the markets. The GARCH model is then, [1 B(L)]( 2 t vt ) = 0 + A(L) 2t ; (17.30) + A(L) 2 t + [1 (17.31) which can be written as, [1 B(L)]( 2t ) = [1 B(L) 0 B(L)]vt ; and A(L)]( 2t ) = 0 + vt B(L)vt 1; (17.32) which is an ARMA process. This shows us that we can identify a GARCH process using the same tools as an ARMA model. That is, by looking at the autocorrelations and partial autocorrelations of ^2t ; estimated from OLS. Solving for the GARCH(1,1) model, 2 144 = 0 +( 1 + 2 1) t 1 + 1 vt 1 + vt : (17.33) P If 1 + 1 = 1, or ( i + i ) = 1 in GARCH(q; p) model, we get what is called an integrated GARCH model. t ARCH MODELS 3) ARCH(q) model with explanatory variables, ht = 0 + A(L) t 2 + xt ; (17.34) 1. where xt is a vector of explanatory variables, and a vector parameters. In this model we have added explanatory variables into the ARCH process, just like we can add exogenous explanatory variables into an ARMA model. 4) M-ARCH Multivariate ARCH. The multivariate ARCH is basically an extension of the univariate model to a system of equations with time varying variances and covariances, like h11;t h12;t ::: h1n;t h h22;t ::: h2n;t Ht = 21;t ::: ::: ::: ::: hn1;t hn2;t ::: hnn;t The M-ARCH is like a VAR model for a system of variables, only now the system is extended to allow for interaction among the variances as well. Typical applications of multivariate ARCH are CAPM models of asset portfolios. 5) ARCH in mean. It is possible to “put back” the ARCH process into the conditional mean of the process, and let it represent some variable, like a time varying risk premium as an example. In this case we get the following system, 1=2 yt = xt + ht ht = 0 + A(L) t + t 1: (17.35) There exists various ways of ’putting’the variance ’back’in the mean equation. The example above assumes that it is the standard error which is the interesting variable in the mean equation. 6) IGARCH. Integrated ARCH. When the coe¢ cients sum to unity we get a model with extremely long memory. (Similar to the random walk model). Unlike the cases discussed earlier the shocks to the variance will not die out. Current information remains important for all future forecasts. We talk about an integrated variance and persistence in variance. A signi…cant constant term in an GARCH process can be understood as a mean reversion of the variance. But if the variance is not mean-reverting, integrated GARCH is an alternative, that in a GARCH(1,1) process can put the constant zero, and restrict the two parameters to unity. 7) EGARCH. Exponential GARCH and ARCH models. (Exponential due to logs of the variables in the GARCH model). These models have the interesting characteristic that they allow for di¤erent reactions from negative and positive shocks. A phenomenon observed on many …nancial markets. In the output the …rst lagged residual indicated the e¤ect of a positive shock, while the second lagged residual (in absolute terms) indicates the e¤ect of a negative shock. 8) FIGARCH. Fractionally Integrated GARCH. This approach builds on the idea of fractional integration and allows for a slow hyperbolic rate of decay for the lagged squared innovation in the conditional variance function. See Baille, Bollerslev and Mikkelsen (1996). 9) NGARCH and NARCH Non-linear GARCH and ARCH models. 10) Common Volatilty. Introduced by Engle and Isle 1989 (and 1993), allows you to test for common GARCH Structure in di¤erent series. SOME DIFFERENT TYPES OF ARCH AND GARCH MODELS 145 11) Other types of GARCH models. In the literature there exists a number of X-GARCH-type of models, it is not possible to keep track of all possible twists here, but 1-10 are the relevant approaches. 17.3 The Estimation of ARCH models Let us now turn to the estimation of ARCH and GARCH models. The main problem is the distribution of the error terms, in general they are not normally distributed. The most used alternatives are the t-distribution and the gamma distribution. In applications in …nance and foreign exchange rates a t-distribution is often motivated by the fact the empirical distributions of these variables display fatter tails than the normal distribution. If we assume that the residuals of the model follow a normal distribution, we have that the conditional variance is normally distributed, or t j t 1 NID(0; 2 ). Using that assumption the following likelihood function is estimated, log L = T log 2 2 T 2 T X t (log ht ) + : 2 t=1 ht (17.36) Notice that there are two equations involved here, the mean equation and the variance equation. The process is correctly modelled …rst when both equations are correctly modelled. To estimate ARCH and GARCH processes, non-standard algorithms are generally needed. If yt i is among the regressors some iterative method is always required. (GAUSS, RATS, SAS provide such facilities). There are also special programs which deal with ARCH, GARCH and multivariate ARCH. The research strategy is to begin by testing ARCH, by standard tests procedures. The following LM test for q order ARCH, is an example, ^2 t = 2 1^t 1 + 2 2^t 2 + ::: + 2 q ^t q + yt + vt ; (17.37) 2 where T R2 (q). Notice that this requires that E( ) = 0, and E( t t i ) 6= 0, for i 6= 0: If ARCH is found, or suspected, use standard time series techniques to identify the process. The speci…cation of an ARCH model can be tested by Lagrange multiplier tests, or likelihood ration tests. Like in time series modeling the Box-Ljung test on the estimated residuals from an ARCH equation serves as a misspeci…cation test. ARCH type of processes are seldom found in low frequency data. High frequency data is generally needed to observe these e¤ects. Daily, weekly sometimes monthly data, but hardly ever in quarterly or yearly data. Finally, remember two things, …rst that ARCH e¤ects imply thicker tails than the standard normal distribution. It not obvious that the normal distribution should be used. On the other hand it, there is no obvious alternative either. Often the normal distribution is the best approximation, unless there is some other information. On example, of other information, is that some series are leptokurtic, higher peak than the normal, in combination with fat tails. In that case the t-distribution might be an alternative. Thus using the normal density function is often an approximation. Second, correct inference on ARCH e¤ects builds upon a correct speci…cation of the mean equation. Misspeci…cation tests of the mean equation are therefore necessary. 146 ARCH MODELS 18. ECONOMETRICS AND RATIONAL EXPECTATIONS The presence of expectations have consequences for econometric model building. In particular rational expectations have extremely important consequences. The most pessimistic views, following from rational expectations, reduce econometric modeling to simple data description, with little, or no room, for increasing our understanding of the behavior of economic agents. Muth’s (1961) original de…nition of rational expectations goes very far. It assumes that agents know the true data generating process (DGP) of the complete system. This is in contrast to the econometrician who must estimate what he/she thinks is the DGP. The econometrician must also test for signi…cant changes in his/her model before he/she can …nd out whether the process has changed. In econometrics we can only deal with a limited aspect of rational expectations, namely expectations formed conditionally on past (observed) history. In contrast to using econometrics, in the world of Muth and other rational expectation theorists, agents are free to form the best expectation at any time without estimating, or making inference from historical data. We can describe the econometric approach to rational expectations as follows, let xet be the expected future value of the variable xt held by the agent(s) at time t. The expectation held at time t is xet = E(xet j It ), where It is the information set containing the historical data used to form the expectation. Under rational expectations, by de…nition, the information set contains all relevant information for determining the expectation so that the di¤erence between the actual outcome of xt+1 and its expectation (xet ) is zero, E[(xt+1 xet ) j It ] = 0: This is a weak condition. It allows expectations to be erroneous in individual periods, but requires that they are correct in average. Thus, in applied work the di¤erence between the outcome and the expectation should a martingale di¤erence process. Assuming that the di¤erence is also a white noise innovation process is generally stronger than necessary. If the ’ordinary’ not expectations based econometric model is formulated as yt = xt + et ; the assumption of rational expectations leads to the following model, yt = Efxt+1 j It g + et ; or yt = xet + et ; (18.1) where xet is the expected value of the variable xt :1 18.0.1 Rational v.s. other Types of Expectations In earlier literature some researchers used to model other types of expectations than rational expectations; like ”myopic”or ”static”expectations. These alternatives are generally ad hoc, and not based on any reasonable assumptions about the 1 This is a generic example where xet can be any variable, including yte : ECONOMETRICS AND RATIONAL EXPECTATIONS 147 behavior of economic agents. Other expectations, than rational, imply that agents might ignore information that would raise their utility. With anything than rational expectations agents will be allowed to make systematic mistakes, implying that they ignore pro…t opportunities or that they are not, for some not explained reason, maximizing their utility. The economic science has yet to identify such behavior in the real world. Rational expectations becomes an equilibrium condition in the sense that there the di¤erence between prediction and outcomes cannot be predicted. A model which allows for predictable di¤erences between the expectations and the outcome is not complete without an economic explanation of what the di¤erence means, and why it occurs. The correct way to approach the modeling of expectations is assume that agents form expectations so that they do not make systematic mistakes that reduce their welfare. Information used to predict the future will be collected and processed up to the point were the costs of gathering more information balances the revenue of additional information. Based on this type of behavior it might, as a special case, be optimal to use say today’s value of a variable to predict all future values of that variable. But, these are exceptions from the rule. In general there is a catch 22 situation in the modeling rational expectations behavior. If the econometrician …nds that the agents are doing systematic mistakes from ex post data, this is no evidence against the rational expectations hypothesis. Instead, the empirical …nding might the result of conditioning on the wrong information set. Alternatively, the modeling of the expectation might be correct, and be an unbiased and e¢ cient estimate of the expectation held only at a certain point in time. This argument also include situation where there is a small probability of an event with large consequences, as devaluations, unpredicted changes in the monetary regime, wars, natural disasters etc. To examine these situations generally requires further testing of model, were the outcome will depend to a large extent on assumptions regarding distributions of the processes, if they are linear or non-linear etc. The discussion about other types of expectations brings us to the concepts of forward looking v.s. backward looking behavior. The di¤erence can explained as follows. Consumption based on forward looking behavior is determined on the basis of expected future income. Consumption based on actual (existing) income is backward looking. In practice there might not a big di¤erence, your present or recent income might be a good approximation to your future income. In some cases rational expectations might be to base decisions on contingent rules, and revise these rules only when the costs of deviating from the optimal/desired consumption is ’too big’(or when the alternative cost to being outside equilibrium is to high). 18.0.2 Typical Errors in the Modeling of Expectations Without given values of the expected value there are two types of common mistakes in econometric models on expected driven stochastic processes. The …rst mistake is to substitute xet with the observed value xt : This leads to an error-invariables problem, since xt = xet + vt ; where vt is E(vt ) = 0:The error-in-variable problem implies that will not be estimated correctly. OLS is inconsistent for the estimation of the original parameter. The second mistake is to model the process for xt and substitute this process into 18.1. Assume that the variable xt follows an AR(2) process, like xt = a1 xt 1 + 148 ECONOMETRICS AND RATIONAL EXPECTATIONS a2 xt 2 + nt , where nt N ID(0; 2 yt = a1 xt 1 = 1 xt 1 + ): Estimation of equation 18.1 leads to, + a2 xt 2 xt 2 2 + et + et : (18.2) This estimated model also gives the wrong results, if we are interested in estimation the (deep) behavioral parameter . The variables xt 1 and xt 2 are not weakly exogenous for the parameter of interest ( ) in this case. The estimated parameters will be a mixture of the deep behavioral parameter and the parameters of the expectations generating process (a1 and a2 ). Not only are the estimates biased, but policy conclusion based on this estimated model will also be misleading. If the parameters of the marginal model, (a1 and a2 ) describe some policy reaction function, say a particular type of money supply rule, changing this rule, i.e. changing a1 and a2 will also change 1 and 2 : This is a typical example of when super exogeneity does not hold, and when an estimated model cannot be used to form policy recommendations. What is the solution to this dilemma of estimating ”deep”behavior parameters, in order to understand working of the economy better? 1. One conclusion is that econometrics will not work. The problems of correctly specifying the expectation process in combinations with short samples make it impossible to use econometric to estimate ”deep” parameters. A better alternative is to construct micro-based theoretical models and simulate these models. (As example, use calibration techniques) 2. Sim’s solution was to advocate VAR models, and avoid estimating ”deep” parameters. VAR models can then be used to increase our understanding about the economy, and be used to simulate the consequences of unpredictable events, like monetary or …scal policy shocks in order to optimize policy. 3. Though the rational expectations critique (Lucas, Sims and others) seem to be devastating for structural econometric modeling, the critique has yet to be proven. In surprisingly many situations, policy changes appear to have small e¤ects on estimated equations, i.e. the e¤ects of the switch in monetary policy in the UK in early 1980s. 4. Finally, the assumption of rational expectations provides priori information that can be used to formulate an econometric model from the beginning. There are, in principle, three ways in which one can approach this problem; i) substitution, ii) system estimation based on the Full Information Maximum Likelihood (FIML) estimator or iii) use the General Methods of Moments (GMM) estimator. Substitution means to replace the expected explanatory variable with an expectation. This expectation could either be a survey expectation or an expectation generated by a forecasting model, i.e. an ARIMA model. The FIML method can be said build in the econometric forecast in an estimated system. The GMM estimator builds on the assumption that the explanatory variable and the residuals are orthogonal to each other. Since, rational expectations implies that the (rationally expected) explanatory variables are orthogonal to the residuals, the GMM estimator is well suited for rational expectations models. Because of this it is the preferred choice when it comes to estimating rational expectations models, especially in …nance applications. ECONOMETRICS AND RATIONAL EXPECTATIONS 149 18.0.3 Modeling Rational Expectations (This section is very incomplete -see overheads) The substitution approach is perhaps the easiest way of modeling rational expectations. The approach is to …nd an estimate of Efxt j It g: The simplest approach is to let the information set contain only historical values of xt : As an example suppose that xt is an AR(1) process, so xt = 1 xt 1 + vt where vt is N ID(0; 2 ). The estimated process gives the estimates x ^t that can be substituted into equation 18.1. The outcome of the substitution is yt = x ^t + ut where ut = et (^ xt xet ) = et (vt v^t ): (18.3) OLS will lead to an unbiased estimate of ^ ; because x ^t is weakly exogenous w.r.t. : FIML estimation builds on substituting xet with the actual value xt and estimate this equation simultaneously with the marginal model for xt ; say the AR(1) model assumed in the substitution example above. GMM and Instrumental Variables techniques start with substitution of the expected value (xet ) with the actual observation (xt ), and then approach the errorin-variables problem. The key to the solution lies in the assumption that the di¤erence between the expectations and the actual outcome is orthogonal to the information set used, the basic assumption for the method of moments estimator. The variables in the marginal process and the possible exogenous variables in the conditional model can then be used as instruments in the estimation of . 18.0.4 Testing Rational Expectations (To be completed) Tests concerning given values of xet . Given some values of the expectation process, there are three types of tests that can be performed. 1. Test if the di¤erence between the expectation and the outcome is a martingale di¤erence process, conditional on assumptions regarding risk premiums. 2. Test for ”news”. Under the assumption of rational expectations the expected driven variable should only react the unpredictable event ”news”but not to events that can be predicted. These assumptions are directly testable as soon as we have a forecasting model for xet : 3. Variance bounds tests. Again, given xet , it follows that the variance of yt in equation 18.1 must be higher than the variance of xet : Encompassing tests If a model based on taking account of assumed rational expectations behavior is ”the correct model”, it follows that this model should encompass other models with lack this feature. Thus, encompassing tests can used to discriminate between models based on rational expectations and other models. Tests of super exogeneity 150 ECONOMETRICS AND RATIONAL EXPECTATIONS It follows from the rational expectations assumption that the parameters of the conditional model will change whenever the parameters of the marginal model change. First, if it can be established that the conditional model is stable, while the marginal model changes, this would be evidence against the rational expectations assumption, at least in the form of forward looking behavior. In the same way, it is possible to test for joint changes/shifts in the marginal and conditional models. 1. Is rational expectations important? The answer is it depends on your problem. If you really want to estimated a stochastic phenomena derived from theory, especially in …nance, it is important to take rational expectations into account. It has to be at least weakly rational expectations because nobody has found any solid evidence against weak rational expectations. However, if you want to forecast or do ’standard’ structural modelling you can test for super exogeneity, and thereby also for rational expectations. Ericsson and Hendry (1989), Ericsson and Irons (1995), and Ericsson and Hendry (1997) do this for almost all instances of radical economic policy changes and …nds no evidence of the structural breaks in the econometric models predicted by the rational expectations theory. Thus, in practice it is not a big problem unless you want it to be a big problem. ECONOMETRICS AND RATIONAL EXPECTATIONS 151 152 ECONOMETRICS AND RATIONAL EXPECTATIONS 19. A RESEARCH STRATEGY This section describes a research strategy for …nding a “well-de…ned statistical model” of the DGP, which also has an economic interpretation. 1. I. Start from theory! Economic theory gives the parameters of interest and the relevant variables for estimating these parameters. Furthermore, theory suggest interesting long-run equilibria, homogeneity conditions etc. It is important to remember that theories are constructions of the human mind. The available data, on the other hand, is the real world. But, there might not be a one to one mapping between the variables of the real world and theory, no matter how ”good” the theory might be. Aggregation over time and individual units, adjustment costs, measurement errors etc. will a¤ect the estimated model. II. Determine the order of integration and type of non-stationarity among the variables. Are some are all variables non-stationary. What type of non-stationarity? The null should be integrated of order one, unless there is su¢ cient evidence to reject this hypotheses. Once you know the order of integration you know to organise variables into meaningful statistical relations. You can test for cointegration, or co-trending, and with this knowledge formulate stationary relations where standard inference is possible, and where you can separate long-run relations (or alternatively permanent shocks) and short-term relations. The golden rule is that if a variable looks like I(1) treat it like an I(1) variable unless you have clear evidence to reject that hypothesis. III. Building a VAR and test for cointegration among integrated variables. Cointegration tests aim at identifying long-run stable (stationary) economically interesting relationships among the variables. This can be done 1) in the form of testing speci…c relations such as PPP, consumption function, money demand etc.. 2) In the case of building and modeling systems, it can be in a "complete system" or by dividing your problem into separate variables such as domestic in‡ation, money demand, economic growth etc. Remember the (asymptotic) property of co-integrating relations, that if you …nd them they are exists even if you add more variables to the model. This requires building a VAR and testing for cointegration. And, the VAR will be the departure for formulating a reduced form VECM and then a structural VECM, or single equation structural equations. The critical step is to …nd suitable order of the VAR (number of lag). The principle is to work from general to speci…c models, and search for parsimonious models. For cointegration tests a log order of 2 is minimum and often optimal. Sometimes identifying extreme outliers and impulse step dummies will help to cure both non-normality and autocorrelation in all equations. If it is not possible to get rid of autocorrelation with a small number of lags (perhaps in combination with dummies and seasonals), the alternative is to focus on second best. Autocorrelation in these equations is very bad for modelling, but it might not be possible to achieve both no autocorrelation and get a parsimonious model with su¢ cient degrees of freedom for inference. In that situation, the relevant question is how much of the variation in the left hand side variables is optimal to model to get an near-well-identi…ed statistical model? A RESEARCH STRATEGY 153 The second best in VAR modelling, is to get rid of autocorrelation in as many equations as possible, hopefully this will include that the vector no error autocorrelation test is not rejected. In this case study the F-test for the signi…cance of each lag across the equations in the model. Look at the LR test for comparing lag orders in the VAR and most important chose the model with the smallest information critera and the smallest residual autocorrelation.1 And, when you test the lag structure, look at the I(1) test for cointegration and study the estimated matrix for possible economically interesting co-integrating vectors, xt 1 :. Quite often you will see what a stable vector coming up quite independent of the lag order and autocorrelation in some residuals. Once the co-integrating rank is determined it remains to identify the estimated co-integrating vectors. If there is only one vector this is relatively simple. If there are more than one vector the vectors should ful…ll the rank condition for identi…cation of co-integrating vectors. This is explained in the work of Juselius, and Johansen and in more advanced text books in econometric time series. The golden rule is that the vectors should be unique (look di¤erent from each other), through the alpha value determine a left-hand variable. This is achieved by …rst choosing a suitable normalization, impose other unit elasticities and or same value but opposite signs, and by restricting some parameters to be zero in some vectors. (Remember that the size of the coe¢ cients are not related to their signi…cance. If co-integration is not found? Rethink the problem. Have you forgotten some important explanatory variable? Look for outliers and test their e¤ects. Use dummies, trends etc. if they can be motivated. Look for structural breaks, sample size. Use …rst (and/or second) di¤erences instead, to get a model with only stationary I(0) variables that leads to estimated parameters with well de…ned distributions. You have to conclude that your model might not be good for long-run analysis. Continue with the modeling process to get the least bad of all possible models, at least. If possible, show that there may be strong a priori information that justi…es the model. Add that cointegration is only an asymptotic result, and that your sample is too short. Consider stop modelling, and conclude that the absence of cointegration is an interesting conclusion in itself! (Data problem, wrong theory, missing explanatory factors etc.). Do not waste too much time on a problem where the answers will be dependent on ad hoc assumptions concerning distributions, or instable results which will be totally model dependent. If you …nd cointegration. Continue by testing for long-run homogeneity assumptions, weak exogeneity. and identi…cation. This can be done by using Johansen’s multivariate co-integrating technique. If more than one vector think about identi…cation of vectors. IV. Decide on single or simultaneous model There are no good tests for weak exogeneity. Typically a good test of simultanity requires the speci…cation of ’the complete’model to work. And, then the work is already done. 1 In PcGive 12 you need to indicate in the "Option" window under Model choce that you want information crteria for each model. Then when you press "Progess" will you see both F-test for lag order and Information critera for the di¤erent VAR modeles you estimated. 154 A RESEARCH STRATEGY If you reduce to single equation (or very limited systems) can you motivate the weak exogeneity. assumptions? The reduced form VECM gives you ideas about what a system might look like, and not like through the estimated (signi…cant) alpha values. It is possible to test for predictability in the VECM by looking at the estimated alpha values, and argue for reductions of the system? Of course, from the reduced for VECM to logical step is to construct a simultaneous structural model based on testing the order and the rank condition in the model. However, this can be a bit of a challenge, especially if you are short of time. Furthermore, identi…cation must be done on signi…cant parameters (including lags) not on the underlying theoretical lag structure. V. Set up the Error Correction Representation. In the following we assume that you have chosen to continue with a single equation. Use the results from Johansen’s multivariate cointegration technique, then formulate an ECM model directly. Test for cointegration in the ADL representation of the model. (PcGive test). It is necessary to choose lag lengths long enough to get white noise residuals. Test if residuals are N ID(0; 2 ), +RESET test if possible. Having white noise innovation error terms is a necessary condition. If not white noise innovation? Add more lags. Did you forget something important? Study outliers. Use dummies and trends to get white noise. But remember that they should be motivated. Or continue to the least worse of all possible models, see above. Rethink the problem or stop. RESET test!! (Perhaps you should try to condition on some other variable instead?) When white noise is established: Is the equation in line with what you think can be an economic meaningful long-run equilibrium? Check sign and sizes of parameters. VI. Reduce the model. Remove insigni…cant variables (t-values below 1.0 to begin with). Start at low lags. Go from general to speci…c. Check misspeci…cation/speci…cation during reductions. Run test summary after each reduction. In PcGive all reductions are saved under Progress. It all about "Data Mining", but done e¢ ciently building on the empirical approach for ARIMA models introduced by Box-Jenkins, and new developments in Statistical theory. Modern mathematical statistical theory explain how you can go about …nding a Data Generating Process by ’reversing’ the sampling process in classical statistics. Textbooks: Spanos, Mittelhammer A RESEARCH STRATEGY 155 In the reduction process remember the following identities, = 1 L and 1 = + L So if you have, as an example, + 1 xt 1 2 xt 2 , where 1 2 (or no signi…cant di¤erence) with di¤erent sign on the lags. This is also + 1 2 x and if j 1 j j 2 j then + 1 2 xt 1 + ( 1 2 ) xt 2 = 1 xt 1 . Hence, you save one degree of freedom under these condition VII. Test the stability of the model Use recursive estimation method in PcGive. Remember that this is also useful during the identi…cation of cointegrating vectors. For instance, it will allows you to see if you need to put in (restricted) impulse dummies in co-integrating vector. VIII. Test for rival models. Encompassing tests. Does your model explain the results, and the failure, of other rival models? Encompassing tests imply a comparison of the goodness of …t between di¤erent models, based on di¤erent explanatory variables. The reduction process might lead to several model with white noise residuals. To discriminate between these models they have to be tested against each other. IX. Test for super exogeneity.(Rational expectations) If you want. Establish the stability of the conditional model without using ad hoc trends or dummies:(= criteria for stability). Test for instability in the marginal model. If it is unstable while the conditional is stable you have super exogeneity. If the marginal model is unstable you can go one step further by forcing the marginal to be stable by imposing trends and dummies in such a way that it becomes stable. Then put these trends and dummies into the conditional model and test if they are significant there? If not you have super exogeneity. And can reject the parts of the assumptions in the rational expectations theory. X. STOP when you …nd a model that is consistent with the data chosen. And where the parameters make ”economic sense.” In other words "a well-de…ned statistical model". That is a model with white noise innovation residuals and stable parameters, which is also encompassing all other rival models. Encompassing meaning that your model explain other models and picks up more of the variation in the dependent variables, and which has an economic meaning. 1. XI. Report our results both parameters and misspeci…cation tests. It is not su¢ cient report only R2 and DW-values. Show test summary (corresponding) and graphs of your data, in levels and …rst di¤erences, and error terms, etc. Be open minded and inform the reader of the tests and the problems you have found. Don’t try to prove things which one can easily reject by a simple test. The rule is to minimize the number of assumptions behind your model, and remember that the errors are the outcome of the formulation of the model. 156 A RESEARCH STRATEGY 20. REFERENCES Andersson, T.W. (1971) The Statistical Analysis of Time Series, John Wiley & Sons, New York. Andersson, T.W. (1984) An Introduction to Multivariate Statistical Analysis, John Wiley & Sons, New York. Banerjee, A., J. Dolado, J.W.Galbraith and D.F. Hendry, (1993) Cointegration, Error-Correction and the Econometric Analysis of Non-stationary Data, (Oxford University Press, Oxford). Baillie, Richard J. and Tim Bollerslev, The long memory of the Forward premium, Journal of Money and Finance 1994, 13 (5), p. 565-571. Baillie, Richard J., Tim Bolloerslev and Hans Ole Mikkelsen (1966) Fractionally Integrated Generalized Autoregressive Heteroscedastcity, Journal of Econometrics 74, 3-30. Banerjee, A., R.L. Limsdaine and J.H Stock (1992) Recursive and Sequential tests of the Unit Root and Trend Break Hypothesis: Theory and International Evidence”, Journal of Business and Economics Statistics ?. Cheung, Y. and K. Lai (1993), Finite Sample Sizes of Johansen’s Likelihood Ratio Tests for Cointegration, Oxford Bulletin of Economics and Statistics 55, p. 313-328. Cheung, Y. and K. Lai (1995) “A Search for Long Memory in International Stock Markets Returns,” Journal of International Money and Finance 14 (4), p.597-615. Davidson, James, (1994) Stochastic Limit Theory, Oxford Univresity Press, Oxford. Dickey, D. and W.A. Fuller (1979), Distribution of the Estimators for Autoregressive Time Series with a Unit Root, Journal of the American Statistical Association 74. Diebold, F.X. and G.D. Rudebush (1989), “ Long Memory and Persistence in Aggregate Output,”Journal of Monetary Economics 24 (September), p. 189-209. Eatwell, J., M. Milgate and P. Newman eds., (1990), Econometrics (Macmillian, London). Eatwell, J., M. Milgate and P. Newman eds., (1990) Time Series and Statistics (Macmillian, London). Engle, Robert F. ed. (1995) ARCH Selected Readings, Oxford University Press, Oxford. Engle, R.F. and C.W.J. Granger, eds. (1991), Long-Run Economic Relationships. Readings in Cointegration, (Oxford University Press, Oxford). Engle, R.F. and B.S. Yoo (1991) “Cointegrated Economic Time Series: An Overview with New Results, in R.F Engle and C.W. Granger, eds., Long-Run Economic Relationships. Readings In Cointegration (Oxford University Press, Oxford). Ericsson, Neil R. and John S. Irons (1994) Testing Exogeneity, Oxford University Press, Oxford. Fuller, Wayne A. (1996) Introduction to Statistical Time Series, John Wiley & Sons, Nw York. Freud, J.E. (1972) Mathematical Statistics, 2ed.(Prentice/Hall London). Granger and Newbold (1986), Forecasting Economic Time Series, (Academic Press, San Diego). REFERENCES 157 Granger, C.W.J. and T. Lee (1989) Multicointegration, Advances in Econometrics, 8, 71-84. Hamilton, James D. (1994) Time Series Analysis, Princton University Press, Priceton, New Jersey. Hargreaves, Colin P. ed. (1994) Nonstationarity Time Series Analysis and Cointegration, Oxfod University Press, Oxford. Harvey, A. (1990), The Econometric Analysis of Time Series, Philip Allan, New York). Hendry, David F. (1995) Dynamic Econometrics, Oxford University Press, Oxford. Hylleberg, Svend (1992) Modelling Seasonality, Oxford University Press, Oxford. Johansen, Sören (1995) Likelihood-Based Inference in Cointegrated Vector Autoregressive Models, Oxford University Press, Oxford. Johnston, J. (1984) Econometric Methods (McGraw-Hill, Singapore). Kwiatkowsky, D., P.C.B. Phillips, P. Schmidt and Y. Shin (1992) ”Testing the Null Hypothesis of Stationarity Against the Alternative of a Unit Root,” Journal of Econometrics 54, p. 159-178. Lo, Andrew W. (1991) “Long-Term Memory in Sock Market Prices,”Economtrica 59 (5:September), p. 1279-1313. Maddala, G.S. (1988) Introduction to Econometrics (McMillian, New York). Morrison, D.F. (1967) Multivariate Statistical Methods, McGraw-Hill, New York). Pagan, A.R. and M.R. Wickens (1989) ”Econometrics: A Survey,” Economic Journal, 1-113. Park, J.Y. (1990), ”Testing for Unit Roots and Cointegration by Variable Addition,”in T. B. Fomby and G.F. Rhodes (eds.) Co-integration, Spurious Regressions, and Unit Roots: Advances in Econometrics 8, JAI Press, New York. Perron, Pierre (1989) ”The Great Crash, the Oil Price Shock and the Unit Root Hupothesis”, Econometrica 57, 1361-1401. Phillips, P.C.B (1988) Re‡ections on Econometric Methodolgy, The Economic Record -Symposium on Econometric Methodolgy, December, 344-359. Sjöö, Boo (2000) Testing for Unit Roots and Cointegration, memo. Sowell, F.B. (1992) “Modeling Long-Memory Behavior with the Fractional ARMA Model,” Journal of Monetary Economics 29 (April),p. 277-302. Spanos, A. (1986) Statistical Foundations of Econometric Modelling (Cambridge University Press, Cambridge). Wei, William W.S. (1990) Time Series Analysis. Univariate and Multivariate Methods, (Addison-Wesley Publishing Company, Redwood City). 20.1 APPENDIX 1 A1 Smoothing Time Series — Lag Windows. In the discussion about non-stationarity di¤erent ways of removing the trend in a time series was shown. If the trend is removed from, say, GDP we are left with swings in the data that can be identi…ed as business cycles. In time series analysis such cycles are referred to as “low frequency” or periodic components. Application of smoothing …lters arise in empirical studies of real business cycles, and in modelling …nancial variables daily interest rates where for example news about in‡ation and other variables occur only at monthly intervals and might 158 REFERENCES cause monthly cycles in the data.1 Smoothing methods, of course, are related closely to spectral analysis. In this appendix we concentrate on two …lters, or lag windows, which represent the “best”, or most commonly used methods for time series in time domain. Start from a time series, rt . What we are looking for is some weights bi such that the …ltered series xt , is free of low frequency components, xt = i=+k X bi rt+i : (20.1) i= k In this formula the window is applied both backwards and forwards, implying a combination of backward and forward looking behavior. Whether this is a good or a bad thing depends totally on the series at hand, and is left to the judgment of the econometrician. The alternative is to let the window end at time i = 0. The literature is …lled with methods of calculating the weights bi , in this appendix we will look at the two most commonly used methods; the Partzén window and the Tuckey-Hanning window. The Parzén window is calculated using the following weights, 8 9 < 1 6(i=k)2 + 6(j i j =k)3 ; j i j k=2; = 2(1 j i j =k)3 ; k=2 j i j k; wi = : ; 0; j i j k; where k is the size of the lag window. The Parzén window tries to …t a third grade polynomial to the original series. An alternative is the so called Tuckey-Hanning window, calculated as, 1=2 [1 + cos( i=k)] ; j i j k; wi = 0; jij k Like the Parzen window, the weights need to be normalized. Under optimal conditions, that is the correct identi…cation of underlying cycles, the di¤erence between xt and rt , will appear as a normal distribution. The problem is to determine the bandwidth, the size of the window, or k in the formula above. Unfortunately there is no way easy way to determine this in practice. Choosing the size of the lag window involves a choice between low bias in the mean or a high variance of the smoothed series. The larger the window the smaller the variance but the higher is the bias. In practice, make sure that the weights at the end of the window are close to zero, and then judge the best …t from comparing xt rt . As a rule of thumb, choose a bandwidth equal to N exp(2=5), the number of observations (N ) raised to the power of 2 over 5. The alternative rule is to set the bandwidth equal to N 1=4 , or make a decision based on the last signi…cant autocorrelation.. Since the choice of the window is always ad hoc in some sense, great care is needed if the smoothed series is going to be used to reveal correlations of ’great economic consequence’. APPENDIX II Testing the Random Walk Hypothesis using the Variance Ratio Test. For a random walk, xt = xt 1 + "t , where "t N ID(o; 2 ); we have that the variance is 2 t and that the autocovariance function is cov(xt ; xt k ) = (t k) 2 . It follows that cov(xt ; xt 1 ) = 21 , and that cov(xt ; xt k ) = 21 k. De…ning 1 2 k = k cov(xt ; xt 1 ). For a random walk we get that the estimated variance ratio V R(k) = ^ 2k ^ 21 is not signi…cantly di¤erent from zero. The estimated (unbiased) 1 To be clear, we are not saying that daily interest rates necessarily contain monthly cycles, only that it might be the case. One example is daily observations of the Swedish overnight interbank rate. APPENDIX 1 159 autocovariances are given as, for k = 1 ^ 21 = 1 T 1 T X (xt xt ^ )2 ; 1 (20.2) t=1 and, for k > 1; ^ 2k = k(T T X 1 k + 1)(1 k T ) t=k (xt xt k^ )2 ; k (20.3) where ^ = T1 (xT x0 ), and T is the sample size. Assuming homoscedasticity, the asymptotic variance of the random variable V R(k) is, (k) = 2(2k 1)(k 3kT 1) : (20.4) Under these assumptions a test statistic is given as Z(k) = V R(k) [ (k)] 1 1 2 !a N (0; 1); (20.5) where !a indicates the test statistic converges to an asymptotic normal distribution. Since many time series, especially in …nance, show time varying heteroscedasticity, the test statistics need to be modi…ed to take this into account. Lo and Mackinlay (1988) show that a heteroscedasticity consistent estimator of the asymptotic variance is given as, (k) = 2 k X1 j ^ (j) k 1 j=1 (20.6) where ^(j) = PT t=j+1 (xt xt hP T 2 1 ^ ) (xt t=1 (xt xt 1 j xt i2 ^) j 1 ^) 2 : (20.7) The heteroscedastic consistent test statistic is therefore, Z (k) = V R(k) [ (k)] 1 1 2 !a N (0; 1): (20.8) The test is performed by calculating sequences of V R(k) as k goes from 1 to n, where n is some chosen fraction of the total number of observations. Since the test statistics only holds asymptotically, Monte Carlo simulations of limited samples are recommended. Under the null hypothesis of a random walk, it will not be possible to reject the assumption that Z(k) or Z (k) are di¤erent from zero. 20.2 Appendix III Operators When dealing with random variables, and series of data there some operators that simpli…es work. This chapter presents the rules of some common operators applied 160 REFERENCES to random variables and series of observations. These are the expectations operator, the variance operator, the covariance operator, the lag operator, the di¤erence operator, and the sum operator.2 The formal proofs behind these operators are not given, instead the chapter states the basic rules for using the operators. All operators serve the basic purpose of simplifying the calculations and communication involving random variables. Take the expectations operator (E), as an example. Writing E(xt ) means the same as ”I will calculate the mean (or the ~ 3 But, I am not telling …rst moment) of the observations on random variable X: exactly which speci…c estimator I would be using, if I were to estimate the mean from empirical data, because in this context it is not important.” One important use of operators is in investigating the properties of estimators under di¤erent assumptions concerning the underlying process. For instance, the properties of the OLS estimator, when the explanatory variables are stochastic, when the variables in the model are trending etc. 20.2.1 The Expectations Operator The …rst operator is the expectations operator. This is a linear operator and, is therefore easy to apply, as shown by the following rules. In the following, let c and k be two non-random constants, i is the mean of the variable i and ij is the covariance between variable i and variable j. It follows that, E(c) = c: ~ = cE(X) ~ =c E(cX) x: ~ = k + cE(X) ~ =k+c E(k + cX) ~ + Y~ ) = E(X) ~ + E(Y~ ) = E(X x x: + y: ~ Y~ ) = E(X)E( ~ ~ Y~ ) = E(X Y~ ) + covar(X x y + xy ; ~ and Y~ are two independent random variables. Compare where xy = 0 if X ~ 2, with the expectation of X ~ 2 ) = E(X)E( ~ ~ + var(X) ~ = E(X X) 2 x + 2 x. The expectations operator is linear and straight forward to use, with one important exception - the expectation of a ratio. This is an important exception since it represents a quite common problem. ~ ~ E(Y ) Y EX ~ is not equal to E(X) ~ : The problem is that the numerator and the denominator are not necessarily independent In this situation it is necessary to use the p lim operator, alternatively let the number of observations go to zero and use convergence in probability or distribution to analyze the outcome. In the derivation of the OLS estimator, the hfollowing transformation is often i 1 ~ Y~ ~ Y~ ): used, when X is viewed as given, E X~ = E X~ Y = E(W A similar problem occurs in …nancial economics. If F is the forward foreign exchange rate, and S is the spot rate; E FS 6= E FS . However, E(ln F ln S) = E(ln F ) E(ln S): 2 The probability limit operator is introduced in a later chapter. the di¤erence between an estimator and an estimate. 3 Notice APPENDIX III OPERATORS 161 20.2.2 The Variance Operator For the variance operator, var(:) or V (:) we have the following rules, var(c) = 0: ~ = c2 var(X) ~ = c2 var(cX) 2 x: ~ = c2 var(X) ~ = c2 var(k + cX) 2 x: ~ = var(Y~ ) + var(X) ~ + 2cov(Y~ + X) ~ = var(Y~ + X) 2 y + 2 x +2 yx : ~ are independent we get, If Y~ and X ~ = var(Y~ ) + var(X) ~ + cov(Y~ + X) ~ = var(Y~ + X) 20.2.3 2 y + 2 x: The Covariance Operator The covariance operator (cov) has already been used above. It can be thought of ~ as a generalization of the variance operator. Suppose we have two elements of X, ~ ~ call them Xi and Xj : The elements can be two random variables in a multivariate process, or refereeing to observations at di¤erent times (i) or (j) of the same ~ i and X ~ j is univariate time series process. The covariance between X ~i; X ~ j ) = Ef[X ~i cov(X ~ i )][X ~j E(X ~ j )]g = E(X ij ; [To be completed!] ~ with p elements can be de…ned The covariance matrix of a random variable X as, 3 ::: ::: 1p 6 21 7 ::: ::: 22 2p 7 6 0 6 ~ ~ ~ ~ ::: ::: ::: ::: 7 Ef[X E(X)][X E(X) ]g = 6 ::: 7 4 ::: ::: ::: ::: ::: 5 p1 p2 ::: ::: pp where ii = 2i ; the variance of the i : th element. Like the expectations and the variance operator there some simple rules. If we ~ i and X ~j ; add constants, a and b to X ~ i + a, X ~ j + b) = cov(X ~i, X ~ j ): cov(X ~ ~ If we multiply Xi and Xj with the constants (a) and (b) respectively, we get, ~ i , bX ~ j ) = ab cov(X ~i, X ~ j ): cov(aX The covariance operator is sometimes also written as C( ). 20.2.4 2 11 12 The Sum Operator In the following operator is, P represents the sum operator. The basic de…nition of the sum n X xi = xm + xm+1 + xm+2 + ::: + xn; (20.9) i=m 162 REFERENCES where m and n are integers, and m n. The important characteristic of the sum operator is that it is linear, all proofs of the following rules of the sum operator build on this fact. If k is a constant, n n X X kxi = k xi : (20.10) i=1 i=1 Some important rules deal with series of integer numbers, like a deterministic time trend t = 1; 2; :::T: These are of interest when dealing with integrated variables and determining the ‘order of probability’, that is the order of convergence, here indicated with O(:); T X t = 1 + 2 + ::: + T = (1=2)[T (T + 1)] = (1=2)[T + 1)2 (T + 1)] t=1 = O(T 2 ) T X t2 (20.11) = 12 + 22 + ::: + T 2 = (1=6)[T (T + 1)(2T + 1)] = (1=3)[(T + 1)3 = O(T 3 ) t=1 T X t3 (3=2)(T + 1)2 + (1=2)(T + 1)] (20.12) = 13 + 23 + ::: + T 3 = (1=4)[T 2 (T + 1)2 ](1=4)[T + 1)4 t=1 2(T + 1)3 + (n + 1)2 ] = O(T 4 ): 20.2.5 (20.13) The Plim Operator An estimator should be unbiased, have a minimum variance and be consistent. In limited samples these requirements will not always be met. To investigate what happens as the sample size increases towards in…nity we us probability limits. If ^ is an estimate of the true parameter , we say that the estimator E(^) is consistent if the probability that we estimate as the sample size increases to in…nity is equal to one. That is as the sample size approaches the population size, we should end up with the parameter describing the population and nothing else. Formally this can be stated as: the estimator E(^) is a consistent estimator of if, for arbitrary small (positive) numbers and , there exists a sample size (n) such that, Pr ob[j ^ j< ] > 1 for n > n0 : (20.14) This can also be written as p lim ([j ^ n!1 j< ] = 1 (20.15) or, in shorthand as, APPENDIX III OPERATORS 163 ^! or p lim ^ = : (20.16) Probability limits are useful for examining the asymptotic properties of estimators of stationary processes. There are a few simple rules to follow, p lim(ax + by) = a p lim(x) + b p lim(y); (20.17) p lim(xy) = p lim(x) p lim(y); (20.18) p lim(x=y) = [p lim(x)]=[p lim(y)]; (20.19) p lim(x 1 ) = [p lim(x)] 1 ; p lim(x2 ) = [p lim(x)]2 : (20.20) (20.21) These rules can be extended to matrices as, p lim(AB) = p lim(A) p lim(B); p lim(A 1 ) = [p lim(A)] 1 : (20.22) (20.23) These rules hold regardless of whether the variables are independent or not. 20.2.6 The Lag and the Di¤erence Operators The lag operator is de…ned as Ln xt = xt n: It can also be used to move forward in a time series, L n xt = xt+n : With the lag operator is becomes possible to write long lag structures in a simpler way. From the lag operator follows the di¤erence operator =1 L xt = xt xt such that 1 Notice that the di¤erence operator can be used as, xt = xt + xt 1 or as, xt 1 = xt xt Di¤erencing at higher order is done as d 164 xt = (1 L)d xt REFERENCES Setting d = 2 we get, 2 xt = (1 L)2 xt = (1 = xt 2xt 1 + xt 2L + L2 )xt 2 = xt xt 2 The letter d indicates di¤erences, which can be done by integer numbers such as -2, -1, 0, 1 and 2. It is also possible to use real numbers, typically between -1.5 and +1.5. With non-integer di¤erencing we come fractional integration, and so-called long run memory series. If variables are expressed in log, which is the typical thing in time series, the …rst di¤erence will be a close approximation to per cent growth. The lag operator is sometimes called the backward shift operator and is then indicated with the symbol B n . The di¤erence operator, de…ned with the backward shift operator is written as 5d = (1 B)d : Econometricians use the terms lag operator and di¤erence operators with the symbols above. Time series statisticians often use the backward shift notations. APPENDIX III OPERATORS 165