New York University - Local Governance Research Laboratory

MA Econometrics I Dr. Arnaud Chevalier Department of Economics University College Dublin September 2003 Table of Contents INTRODUCTION LECTURE 2: THE CLASSICAL LINEAR REGRESSION MODEL LECTURE 3: MULTIVARIATE MODEL AND HYPOTHESIS TESTING 1 Introduction to Econometrics and the Classical Linear Regression Model Wooldridge, Chapter 2,3 2 1.1 Introduction to Econometrics What is econometrics and what do econometricians do? Basically, they try to answer questions as diverse as making forecast, assess the safety of nuclear power plants, test theories to make extra profits on the stock exchange or evaluate the efficiency of policies. So you may need econometrics not only to get your Master but also in various jobs (industry, banking, consultancy, academia…). Compared with mathematical statistics, the difference is that the econometricians rely mostly on observational data rather than experimental data. This creates specific problems theoretically but also empirically. So for this course, we will put the emphasis on how to conduct an empirical econometric analysis, and then introduce the necessary theory. Because we use data to answer quantitative questions, our answers will vary if we use a different set of data. Thus, not only should we provide an answer but also a measure of how precise this answer is. (Unlike in the Hitchhiker guide to the Galaxy, were the answer is always 42!). Generally, an econometric analysis is conducted to test some hypotheses, so after presenting the simple model, we will move on to the basic of testing. Data specificity means that we will have to depart from the simple model in more than one way, which is what the various remaining chapters are all about. For, Malinvaud (1966) - ‘The art of the econometrician consists in finding the set of assumptions which are both sufficiently specific and sufficiently realistic to allow him take the best possible advantage of the data available to him’ Hendry says not to confuse econometrics with economic-tricks or economystics!!! Whilst experimental data will greatly simplify our tasks, such data are rarely available in social sciences (ethic reason). In this course, we will be concerned with two types of data: cross section, when typically a large (random) sample of the population of interest is surveyed at one point in time and time series, where the evolution of a few variables over time is recorded. Cross section can also be pooled across time, in order to increase the sample size and more importantly to assess how a key relationship changes over time (for example after a policy change). Cross section data is typically used in micro-economic analysis while time series are more associated 3 with macro-analysis and finance. The analysis of time series is complicated by issues of seasonality, trend and persistence, but we will see how to deal with these issues. Other types of data exist, such as panel (longitudinal data), where observations are recorded time after time (for example, yearly survey of the same household), and duration where the econometrician is interested in the lapse of time between two events (given a treatment and death, unemployment). Each data type allows you to answer specific questions but has its own difficulty. So the first task in any project is to determine, what is the appropriate data to answer your question of choice. And remember, bad data will always make the life of the econometrician much harder. 12 The Paradigm of Econometrics Existence of an underlying “structure,” a “true” model of an economic phenomenon. We have long studied economic theories of optimization, models of labor supply, demand equations etc. Moving forward to studying econometrics allows for a number of other issues to be formally assessed, such as understanding covariation, predicting outcomes based on knowledge and partial forecasts and controlling future outcomes using knowledge of relationships. The other debate of course is between the overall merits of econometrics as a means of testing economic theory - a number of issues arise in the econometrician's approach (often as a means to simplify estimation and maintain statistical validity) that cause problems. Looking back as far as the 1930's Keynes entered a debate with Tinbergen (the father of the econometric discipline and its first Nobel prize winner) in the Economic Journal which covers many of the issues we will deal with in this course. (i) Omitted Variables Keynes (p.559) quotes from Tinbergen’s book that (p.559) “The part which the statistician can play in the process of the analysis must not be misunderstood. The theories which he submits to examination are handed over to him by the economist; and with the economist the responsibility for them must remain; for no statistical test can prove a theory to be correct”. Tinbergen does go on to state, again reported by Keynes (p.559) that “It can, indeed, prove that theory to be incorrect, or at least incomplete, by showing that it does not cover a particular set of facts.” 4 Keynes claims that this implies that the statistician therefore requires “the economist having furnished ... a complete list” to be able to assign statistical properties to the estimators. (ii) Unobservable Variables Tinbergen (quoted by Keynes (p.561)) states that “The inquiry is, by its nature, restricted to the examination of the measurable phenomena. Non-measurable phenomena may, of course, at times exercise an important influence on the course of events; and the result of the present analysis must be supplemented by such information about the extent of that influence as can be obtained from other sources”. Keynes list as unmeasurable and potentially important variables “political, social and psychological , including such things as government policy, the progress of invention and the state of expectation”. (iii) Linearity Tinbergen (quoted by Keynes (p.563)) notes that “As a rule, curvilinear relations are considered in the following studies only in so far as strong evidence exists. A rough way of introducing the most important features of curvilinear relations is to use changing coefficients ... Another way .. is to take squares of variates or still other functions, among the ‘explanatory series’ ”. However, Keynes is sceptical arguing that (p. 564) “it is a very drastic and indeed usually improbable postulate to suppose that all economic forces are of this character, producing independent changes in the phenomenon under investigation which are directly proportional to the changes themselves; indeed this is ridiculous”. (iv) Specification Keynes notes that (p.155): “It will be remembered that the seventy translators of the Septuagint were shut up in seventy separate rooms with the Hebrew text and brought out with them, when they emerged, seventy identical translations. Would the same miracle be vouchsafed if seventy multiple correlators were shut up with the same statistical material?” (v) Structural Change Keynes believes that to draw any use from the statistical analysis it is important that the model be stable (p.567) “The first step, therefore, is to break up the period under investigation into a series of sub-periods , with a view to discovering whether the results of applying our method to the various sub-periods taken separately are reasonably uniform.” In fact he argues that economic problems are sufficient difficult and unstable that (p.567) “the application of the method of 5 multiple correlation to complex economic problems lies in the apparent lack of any adequate degree of uniformity in the environment” In summary, therefore Keynes believes a lot more thought is required and he says of Tinbergen (and of statisticians, in general) (p.559) that “he is much more interested in getting on with the job than in spending time in deciding whether the job is worth getting on with”. In Hendry's Alchemy or Science we get some definitions - Science is dealing with material phenomena and mainly based on observation, experimentation and induction (induction inferring a general law from particular instances), Alchemy is the idea of being able to extract gold or silver from base metals via a chemical process. According to Hendry the definition of econometrics is (p.388) “An analysis of the relationship between economic variables ... by abstracting the main phenomena of interest and stating theories thereof in mathematical form.” Hendry adds to the list of Keynes points the pressing need to improve the quality of the data, which has seriously lagged behind the increasingly complex and sophisticated techniques available to the applied researcher for analysing such data. In fact, the techniques have only become so sophisticated in an attempt to address the issue of data deficiencies. However, econometrics has a poor reputation. Worswick - Econometricians are not “engaged in foraging for tools to arrange and gather facts, so much as making a marvellous array of pretendtools” (1972); Brown - “running regressions between time series is only likely to deceive” (1972); Leontief - Econometrics as “an attempt to compensate for glaring weaknesses of the data available to us by the widest possible use of more and more sophisticated statistical techniques” (1971); Coase - If you torture the data long enough, nature will confess” (p.37); Leamer “Econometricians, like artists, tend to fall in love with their models”. Leamer draws a distinction between econometrics and science: science is where controlled experiments are undertaken, while there is clearly some uncertainty associated with these experiments, the error is small and the conclusions are therefore “tight”. In Economics no experimentation is practical, the possibilities/uncertainties are boundless and one is only constrained in the variables used by ones imagination, even here there maybe “influential monsters lurking beyond our immediate field of vision” , consequently the errors are potentially enormous. This implies that in economics the researcher would either: 6 (i) look at the data first; however, to look at the data might bias ones judgement, as theories based on the data are difficult to reject based on looking at the data. To illustrate this point he quotes the applied researcher (p.40) who thinks “that a certain coefficient should be positive, and their reaction to the anomalous result of a negative coefficient is to find another variable to include in the equation so that the estimate is positive. Have they found evidence that the coefficient is positive?” This is particularly concerning given that the researcher’s art (p.36) “is practised at the computer terminal (and) involves fitting many, perhaps thousands, of statistical models. One or several that the researcher finds pleasing are selected for reporting purposes” - selective reporting is clearly problematic. or (ii) require an infinitely wide field of vision in order to make discoveries, or as Leamer (p.40) notes “the great human discoveries are made when the horizon is extended for reasons that cannot be predicted in advance and cannot be computerised. If you wish to make such discoveries, you will have to poke at the horizon, and poke again.” To illustrate the problem Leamer of undertaking applied research with a strong prior view of the model, he uses the example of the effects of execution (denoted PX) on murder rates (denoted M) in 44 states in the U.S. (of which 35 have executions and 9 are non-executing. He assumes there are 5 researchers each with a strong prior described below: No Individual Important (key) variables 1 Right winger Punish works (PC,PX,T) 2 Rational maximiser Economic return to crime (PC,PX,T,W,X,U,LF) 3 Eye-for-an-eye Probability of execution (PC,PX) 4 Bleeding Heart Economic hardship (W,X,U,LF) 5 Crime of Passion Punishment doubtful (W,X,U,LF,NW,AGE,URB,MALE,FAMHO,SOUTH) For example, the right winger simply believes the main determinants of murder rates are the punishment variables (PC - probability of conviction, PX - probability of execution, and T mediam sentance served for murder), while the Bleeding Heart is solely interested in the 7 economic deprevation variables (W - median income, X - % families <0.5 of W, U unemployment rate, LF - Labour force participation). Each researcher views the listed variables as the important variables (denoted I) to be included in any model and is prepared to take any linear combinations of the other doubtful (denoted D) variables (see Table 2) Leamer obtains Table 3, which reports the sensitivity of the coefficient on PX under varoious alternative models. Only researchers 1 and 2 obtain consistent coefficient on PX (<0). All others do not find a consistent coefficient and could therefore find any result they please. McAleer argues that the problem with investigating the fragility of estimates is considered in the following example: If the true model is yt    1xt  2zt ut (T) and you estimate y t   *  1*x t  v t (E) then the OLS estimate of b1* is a biased estimate of 1 , with the bias depending on the sign of  2 and cov(x t , z t ) - and this bias could be substantial. Thus there is no reason to believe that if zt is a doubtful variable (D in Leamer’s terminology) that the coefficient estimate on xt should be insensitive to this. McAleer et. al suggest a 5-step methodology to careful analysis: (i) Consistency with theory (ii) Significance both statistical and economic (iii) Indexes of adequacy (“test, test, test” of Hendry) (iv) Fragility or sensitivity (to new data rather than EBA) (v) Encompassing (should dominate all competing models) McAleer explains the similarity between applied econometric analysis and criminal deduction by noting that “Both criminal investigation and econometric analysis involve determining the importance of and collection of data, and a final explanation of the data after previous explanations (if any) have been rejected against the available evidence”. However, the two disciplines depart in that “Like econometricians, Sherlock Holmes is in search of the truth that generated the data. Confronted with a crime or problem, Holmes assiduously gathers data which are needed for a suitable explanation. Unlike econometrics, however, his searches will frequently yield the truth and the culprit will be apprehended”. 8 The difference being that somebody committed the murder (crime), while in economics the truth, the process that generated the data, does not exist and at best you might get an adequate approximation of the process that is not obviously incorrect. However it is still useful to consider how Holmes undertakes a criminal investigation and this is divided into 5 sections. (i) Theory “I have no data yet. It is a capital mistake to theorise before one has data. Insensibly one begins to twist facts to suit theories, instead of theories to suit facts”. (Sherlock Holmes to Dr. Watson in A Scandal in Bohemia). Holmes relied solely upon data in formulating his theories. He had no prior beliefs before starting a case as this would necessarily limit the number of possible suspects. This idea contrasts markedly with that of the classical procedure for conducting statistical inference, whereby formulation of a theory always precedes examination of the data. However, one should, in the final analysis, ensure that “Data … be given the last word in deciding the validity of a theory” (p.322). (ii) Quality of data “It is of the highest importance in the art of detection to be able to recognise out of a number of facts which are incidental and which vital” (Sherlock Holmes to Colonel Hayter in The Adventure of the Reigate Squire). Holmes, like econometricians, did not have the possibility of conducting experiments, but would always be prepared to test his theories against new data. Irrelevant data would always likely to be rejected. Holmes (like economics) was interested in true (not spurious) relationships that explained how crimes (dependent variable) were perpetrated (explained). (iii) Truth Unlike in economics the idea of truth exists for Holmes, in that somebody committed the crime, “We must fall back upon the old axiom that when all other contingencies fail, whatever remains, however improbable, must be the truth” (Sherlock Holmes to Dr. Watson in The Adventures of the Bruce-Partington Plans). However, the applied econometrician seeks to explain an unknown and frequently unobservable, relation between numerous interdependent factors - economic puzzles are far more complex than criminal problems. Even if a relationship exits there is no reason why this should remain constant over time. (iv) Reconciliation with data 9 “What do you think of my theory? … When new facts come to our knowledge which cannot be covered by it, it will be time enough to reconsider it” (Sherlock Holmes to Dr. Watson in The Adventure of the Yellow Face). Holmes was frequently willing to change his position to examine the effects on his theory. Known as statistical robustness in economics, which requires the model be robust to new data and be reconciled with competing theories. (v) Testing “… it is well to test everything” ((Sherlock Holmes to Dr. Watson in The Adventure of the Reigate Squire). For Holmes, like econometricians, it is important to test all parts of any theory for weak links. It is to the credit of Holmes (and some econometricians) that he is prepared to abandon a cherished theory in the light of new data which contradicts the theory. However, Holmes makes these decisions in the face of virtual certainty, rather than the very hazy and uncertain world of economics. In fact this uncertainty has ensured some models have survived well beyond their use by date. 10 Topic II: The Classical Linear Regression Model 2.1 The linear regression model We believe there is a causal link between class size and achievement. However, there is a tradeoff, more teachers means higher wage bill. To justify her case, the local headteacher is asking you to estimate the effect of a change in class size on test score. Basically, you are thinking of the following relationship.  Test Size (2.1) (2.1) can be seen as the definition of a slope coefficient, thus a straight line relating test score to class size, can be written as: Test   0   * Size (2.2) where  0 is the intercept (test score with a class size of 0!! In some applications, the intercept is not meaningful). All smug, you go back to the headteacher who tells you off for not including various other characteristics of a school, that will affect the test performance. She is right (and we will see in the next chapter how to deal with her criticism) but for the moment, all we want to say is that (2.2) is true on average, all these factors (some of them unobservable are grouped into an error term. So in general, we believe there is a linear relationship between an independent variable (X) and a dependent variable (Y). This relationship holds on average, thus for each observation (i) there exists an error term (ui). Thus, we have the generic equation: 11 Yi   0  1 X i  ui (2.3) [insert Slide 6, from Dougherty, CD1] (1)  0  1 X is called the population regression line. 0 and 1 are the coefficients (parameters) to be estimated using the available data. 2.2 Estimating the coefficient of the linear regression model 1 in the population but we estimate it from a As in statistics, we do not know the value of sample of data. For example, looking at the data from test score and pupil teacher ratio, how do we get to estimate  0 and  1 , an eyeball option is not the solution. Figure: Distribution of test score and pupil teacher ratio test score 700 650 600 15 20 Student teacher ratio 12 25 Variable Obs str 420 testscr 420 correlation str/test = -.023 Mean 19.64043 654.1565 Std. Dev. 1.891812 19.05335 Min 14 605.55 Max 25.8 706.75 So let’s first start with a simplistic example, where we only have three observation points. Insert slide 2 from Dougherty CD2 (2) The Ordinary Least Square (OLS) estimator chooses the regression coefficients so that the estimated regression line is “as close as possible” to the observed data. How do we measure closeness? To measure closeness, we rely on the residuals: The difference between the predicted and observed value of the outcome. So let’s pretend that we have estimates of  0 and  1 , say b0 and b1. We can define the fitted (predicted) value of Yi as: Yî  b0  b1 X i (2.4) e i  Yi  Yî (2.5) Then the residuals are: ! do not confuse error ( ui  Yi  X i ) and residual …. Expand on this Insert slide 10 from Dougherty CD1 (3) The OLS estimates of  0 and  1 , b0 and b1 minimise the sum of the square residuals. Why minimising the sum of squared residuals?: - minimising residuals: positive and negative residuals cancel out. 13 - Minimising the sum of the absolute values of the residuals: leads to more complicated calculations. So how do we calculate b0 and b1? Insert Dougherty 4 CD2, (4) So we want to minimize the Residual Sum of Squared (RSS). RSS  e12  e22  e32  ( 3  b1  b2 ) 2  (5  b1  2b2 ) 2  ( 6  b1  3b2 ) 2  9  b12  b22  6b1  6b2  2b1b2  25  b12  4b22  10b1  20b2  4b1b2  36  b12  9b22  12b1  36b2  6b1b2  70  3b12  14b22  28b1  62b2  12b1b2 To minimize RSS, the partial derivative of RSS with respect to b0 and b1 should be equal to 0. (For a minimum, the second derivatives should also be negative). So we have: RSS  0  6b1  12b2  28  0 b1 RSS  0  12b1  28b2  62  0 b2  b1=1.67 , b2=1.50 14 In a more general case, we will have more than 3 observations, so RSS will be defined as: RSS  e12  ...  en2  (Y1  b1  b2 X 1 ) 2  ...  (Yn  b1  b2 X n ) 2  Y12  b12  b22 X 12  2b1Y1  2b2 X 1Y1  2b1b2 X 1 b22 X n2  2b1Yn  2b2 X nYn  2b1b2 X n  ...  Yn2  b12    Yi 2  nb12  b22  X i2  2b1  Yi  2b2  X iYi  2b1b2  X i RSS  Yi 2  nb12  b22  X i2  2b1 Yi  2b2  X iYi  2b1b2  X i RSS  0  2nb1  2 Yi  2b2  X i  0 b1 nb1   Yi b2  X i b1  Y  b2 X RSS  0  2b2  X i2  2 X iYi  2b2  X i  0 b2 b2  X i2   X iYi  b1  X i  0 Substituting b1 for its value, we get: b2  X i2   X iYi  (Y  b2 X ) X i  0 So we get: b2  X i2   X iYi  (Y  b2 X )nX  0 b2  X i2  nX 2    X iYi  nXY 1  1 b2   X i2  X 2    X iYi  XY n  n b2 Var( X )  Cov( X ,Y ) Cov( X ,Y ) b2  Var( X ) 15 X X i n Alternatively, b2 can be written: 1  ( X i  X )(Yi  Y )  ( X i  X )(Yi  Y ) b2  n  1 ( X i  X )2 2  ( X  X )  i n Or 1  X iYi  XY  X iYi  nXY n b2   2 2 1 X  n X 2 2  i  Xi  X n OLS estimates are given by : b1  Y  b2 X b2  (2.6) Cov( X ,Y ) Var ( X ) Which are all equivalent. In the next lecture, we will introduce matrix notations. So back to our pupil teacher example:, we get: Corr ( X , Y )  Cov( X , Y ) Var( X ) * Var(Y ) So using summary statistics and correlation coefficients, we can calculate: b2 = 2.28 and b1=654.1565 – (-2.28) * 19.64043 = 698.9 An increase in the number of students per class by 1, is associated on average with a reduction in test score of 2.28 points. Alternatively, we can predict that in a school where the pupil teacher ratio averages 20, the average test score will be 653.3. 16 test score Fitted values 700 650 600 15 20 Student teacher ratio 25 Econometrics is all about providing answer to practical problems, so, what should our advice to the head teacher be? Let assume, that the school has the median characteristics of our sample: str=19.7, test=654.5. Table 2: Distribution of Student teacher ration and test score STr Test 1% 15.13898 612.65 5% 16.41658 623.15 10% 17.34573 630.375 25% 18.58179 640 50% 19.72321 654.45 75% 20.87183 666.675 90% 21.87561 679.1 95% 22.64514 685.5 99% 24.88889 698.45 Reducing the str by 2 pupils, will move the school to the top 10% on the STR, and will increase the average test score of the school to 660 points (just short of the 60%), so the school will almost become one of the top 40% performer in the country, with the best 10% student teacher ratio. 17 Depending on the cost of extra teachers and how much parents value extra test point, the decision to hire new teachers is cost-beneficial. What if the head teachers had more radical plans, and wanted to cut the STR to 10? Here we cannot say anything because we do not observe schools with such a small ratio. Our inference will be solely based on the linearity assumption of the OLS. (like our prediction of the test score in a class with no pupil). This is an identification problem. (Manski, 1995) draw an example. Identification problems cannot be solved by collecting more of the same data Inference can only be safely made for value for which we have some data. Out of sample prediction will be unpredictable. The relationship between STR and Test may be really different for really small or large value of the STR, but with the available data we have no way of knowing. So we have solved our first econometric problems, let’s review now what we have learnt and state clearly the assumptions that are necessary to make this inference. 2.3 Assumptions behind OLS Here are the assumption needed for OLS to provide an appropriate estimator of the unknown regression coefficients  0 and  1 . - Assumption 1: The conditional mean of u is 0 E(ui / X i )  E(ui )  0 (A2.1) 18 E(u)=0 : The unobservable characteristics affecting the outcome of interest have on average a mean of 0. This is not really restrictive and can be obtained by normalization. For example, teacher quality affects test score, but within our sample, the mean quality of teachers is 0. E(u/X)=0 : This is the most important part of the assumption, it states that the average value of u does not depend on the value of X. The observed characteristics and the error terms are uncorrelated. If Xi and ui are correlated then the conditional mean assumption is violated, and OLS estimates are biased. Figure 1: Conditional probability distribution and population regression function f(u) Y X1 X2 X3 X The conditional mean assumption is also crucial to derive that: E(Y / X )   0  1 X 19 The Population Regression Function is equal to the conditional mean of Y. E(Y/X) is a linear function of X. Thus, we can make statements such as an increase of X of 1 unit, leads to a change of Y of  1 . - Assumption 2: (Xi,Yi) are independently and identically distributed. Observations have been drawn at random for the population and are independent of each other. This assumption is usually broken in Time Series, interest rate today are not independent of their value yesterday. - Assumption 3: Population variance of u is constant for all i. Formally, this condition can be written as:  u2i   2 i Of course  2 is unknown. This property is known as homoskedasticity (constant variance). Draw heteroskedasticity on Figure 1. These assumptions are sometime referred to as the Gauss-Markov conditions. - Assumption 4: Normality assumption One usually assumes that the error term is normally distributed. This will be especially useful when hypothesis testing. 2.4 Properties of OLS - Why among the numerous estimates created by econometricians, OLS is the most popular? Let’s define a few more concepts: n - Total Sum of Square (SST): SST   ( y i  y ) 2 i 1 Dispersion of the outcome of interest around the mean 20 n - Explained Sum of Square (SSE): SSE   ( yˆ i  y ) 2 i 1 SSE measures the estimates of y (since y  yˆ ) n -Residual Sum of Square: SSR   e i 2 i 1 SSR measures the sample variation in the residuals. We have SST=SSE+SSR. To measure how well OLS fits the data, we can look at the ratio of the explained and total variance; this ratio is called the R2. R2  SSE SSR 1 SST SST If the fit is perfect SSR=0, and R2=1. If OLS fit is bad, SSE=0 and R2=0. OLS provides unbiased estimates of  0 and  1 . Proof: Cov( X , Y ) Cov( X ,  1   2 X  u)  Var( X ) Var( X ) Cov( X ,  1 )  Cov( X ,  2 X )  Cov( X , u)  Var( X ) 0   2Cov( X , X )  Cov( X , u)  Var( X ) Cov( X , u)  2  Var( X ) b2  Cov( X , u)    Cov( X , u)  E (b2 )  E   2    E(2 )  E  Var( X )    Var( X )  1 21  2  E Cov( X , u)  Var( X )  2 E (b1 )  E (  1   2 X  u  b2 X )  E (  1 )  E (  2 X )  E ( u )  E (  b2 X )   1   2 X  0  XE (b2 )   1   2 X  X 2   1 b1  Y  b2 X  1   2 X  u  b2 X b0 and b1 are unbiased estimates of  0 and  1 . In fact it can be proven that under the GaussMarkov assumptions, OLS is the Best Linear Unbiased Estimate. Remember that the value of b0 and b1 that we estimate are specific to the sample used. If we have by chance used a non-representative sample, our point estimate will be far from the true value. If our sample is representative, then the larger the sample, the closer to the true value we are likely to be (Central Limit Theorem). Estimate variance: b0 and b1 are random variables (depend on the sample), so they have a distribution. Here, we only state the expression of their variance. Var(b0 )  Var(b1 )  2  X2  1   n  Var( X )  2 nVar( X ) These formulae are only valid in the presence of homoskedasticy and in the hypothetical case were  2 is known. For all purpose, we are mostly interested in Var(b1). i) We can see that the larger the error variance, the larger the variance of our estimate. ii) The more variability in the independent variable, the more precise our estimate. [slide 10, in Dougherty, C3D3] (5) ˆ 2  1 n 2 SSR ei   n  2 i 1 n2 ˆ 2 is interchangeably called the standard error of the regression or the root mean squared error (in stata). Standard error of our estimates can now be produced. 22 ˆ 2 23 Lecture 3: Multivariate model and hypothesis testing Last week, we studied the relationship between class size and test score, but we put a cautious note on our results. The hypothesis that the class size (Xi) and the error terms (ui) were uncorrelated appeared dubious. If untrue we know that OLS estimates of 1 and  0 , say b1 and b0 are biased: b2   2  Cov( X , u) Var( X ) This problem was due to omitted factors, which we think affect class size and student scores. For example, richer parents may put their children in schools with smaller class size, and pay for extra tuition. To limit omitted variable bias, we use multi-variate regression. By including more regressors, we can estimate the effect of class size on score, holding constant these other variables. In the second part of this lecture, we build confidence intervals for our estimates and review various tests. 3.1 A simple example Say that we are interested in the effect of education on earnings but we are concerned that ability also affects education and earnings, so in order to estimate the unbiased effect of education on earnings we want to control for ability. If ability is not included in the model, then the error term will be correlated with education, and lead to biased estimate. Thus, we have the following specified model: ln Y   0  1 S   2 AS  u (3.1) 24 Effect of education and ability on log earnings  0  1 S   2 AS  u o Combined effect ln Y Pure effect of ability Pure effect of education AS S To estimate the coefficients we, as before, minimize the sum of squares residuals (RSS) n RSS   ei2 i 1 where ei  Yi  Yî  Yi  b1  b2 X 2 i  b3 X 3 i The first order conditions for a minimum are: n dRSS  2 (Yi  b1  b2 X 2i  b3 X 3i )  0 db1 i 1 n dRSS  2 X 2i (Yi  b1  b2 X 2i  b3 X 3i )  0 db2 i 1 n dRSS  2 X 3i (Yi  b1  b2 X 2i  b3 X 3i )  0 db2 i 1 25 hence: b1  Y  b2 X 2  b3 X 3 b2  b3  Cov( X 2 ,Y )Var( X 3 ) - Cov( X 3 ,Y )Cov( X 2 , X 3 ) 2 Var( X 2 )Var(X 3 )  Cov( X 2 , X 3 ) Cov( X 3 ,Y )Var( X 2 ) - Cov( X 2 ,Y )Cov( X 2 , X 3 ) 2 Var( X 2 )Var(X 3 )  Cov( X 2 , X 3 ) (3.2) Multiple regression analysis allows one to discriminate between the effects of the explanatory variables, making allowance for their possible correlation. The coefficient of each X variable provides an estimate of its influence on Y controlling for all other X variables. We can demonstrate this and see it in a simple example. As in the simple model, b1 is the intercept, b2 and b3 are the slope coefficients of X2, and X3 respectively. The interpretation of b2 is now the effect on Y of a unit change in X2 holding X3 constant. X3 is often called a control variable. We are interested in the change in Y for a change in X2 and no change in X3. Y  1   2 X 2   3 X 3  u Y  Y  1   2 ( X 2  X 2 )   3 X 3  u =>  2  Y X 2 . reg linc school Table 3.1 Source | SS df MS -------------+-----------------------------Model | 187.090253 1 187.090253 Residual | 1089.20575 4059 .268343373 -------------+-----------------------------Total | 1276.296 4060 .314358622 Number of obs F( 1, 4059) Prob > F R-squared Adj R-squared Root MSE = = = = = = 4061 697.20 0.0000 0.1466 0.1464 .51802 -----------------------------------------------------------------------------linc | Coef. Std. Err. t P>|t| [95% Conf. Interval] -------------+---------------------------------------------------------------school | .1713449 .0064892 26.40 0.000 .1586225 .1840672 _cons | 1.760449 .015451 113.94 0.000 1.730156 1.790741 ------------------------------------------------------------------------------ 26 To get true estimate of the returns to education, we want to hold ability constant. This is done in a multivariate regression. . reg linc school ability Table 3.2 Source | SS df MS -------------+-----------------------------Model | 197.859286 2 98.9296432 Residual | 1026.25968 3963 .258960302 -------------+-----------------------------Total | 1224.11896 3965 .308731138 Number of obs F( 2, 3963) Prob > F R-squared Adj R-squared Root MSE = = = = = = 3966 382.03 0.0000 0.1616 0.1612 .50888 -----------------------------------------------------------------------------linc | Coef. Std. Err. t P>|t| [95% Conf. Interval] -------------+---------------------------------------------------------------school | .1525172 .0068646 22.22 0.000 .1390588 .1659757 ability | .0702532 .0089544 7.85 0.000 .0526975 .087809 _cons | 1.793294 .0158411 113.21 0.000 1.762236 1.824351 ------------------------------------------------------------------------------ Alternatively, we could purge income and schooling of the effect of ability.  To do so, we estimate Linc  c1  c 2 Abil and Sˆ  d 1  d 2 Abil . We then calculate the  residuals Rlinc and RS where Rlinc  linc  linc and RS  S  Sˆ . If we now regress Rlinc on RS, we estimate the effect of education on income, accounting for ability. reg rlinc rs Table 3.3 Source | SS df MS -------------+-----------------------------Model | 127.832486 1 127.832486 Residual | 1026.25967 3964 .258894972 -------------+-----------------------------Total | 1154.09216 3965 .291069901 Number of obs F( 1, 3964) Prob > F R-squared Adj R-squared Root MSE = = = = = = 3966 493.76 0.0000 0.1108 0.1105 .50882 -----------------------------------------------------------------------------rlinc | Coef. Std. Err. t P>|t| [95% Conf. Interval] -------------+---------------------------------------------------------------rs | .1525172 .0068637 22.22 0.000 .1390605 .165974 _cons | 8.56e-09 .0080795 0.00 1.000 -.0158404 .0158404 ------------------------------------------------------------------------------ 27 Note that the two methods provide the same estimate for the returns to education. So a multivariate model allow us to estimate the effect of a variable holding other characteristics constant. This method of partialing out is due to Frish and Waugh and is important when estimating fixed effect regressions (for example multiple observations of the same individual/family/firm) . A simple regression of schooling on earnings would lead to bias estimates of the returns to education, since school and ability are correlated and ability has a positive effect on earnings. This is the omitted variable bias. Omitted variable bias will appear if: - the regressor is correlated with the omitted variable - the omitted variable determines the dependent variable If an omitted variable determines the dependent variable, then it is included in the error term. If this variable is also correlated with X, then Xi and ui are correlated. Omitted variable bias means that the OLS assumption that E(ui/Xi)=0 is broken. The correlation of Xi and ui means that the conditional expectation of u is not zero. From last week, we know that: b2   2  Or b2   2   X ,u Cov( X , u) Var( X ) u X (3.2) The second term in the RHS of (3.2) is the bias due to omitted variable. This bias is not dependent on sample size (increasing sample size does not reduce omitted variable bias), and in this case, b2 is not a consistent estimator of β2. The larger the correlation between Xi and ui, the larger the bias. The direction of the bias depends on the sign of the correlation between Xi and ui 28 Now consider the more general k-variable model, where Yt  1   2 X2 t  k Xkt  ut (3.3) Then upon finding the parameter values  1 ,  2 , ,  k that minimize the RSS, we get k conditions like the equations above which get awkward to solve to obtain exact expressions for  1 ,  2 , ,  k . Matrices do make our life easier, therefore… Consider now the matrix form equivalent, which the basic problem is written as: Y  X   where,  Y1  1 X 1  1  Y  1 X     1  2 2 2   Y , X ,    ,               2        YT  1 X T   T  So the first line has: Y1  1   2 X 1  1 and the tth line has Yt  1   2 X t   t ,t  1,2  T The estimates of  are obtained by minimising the RSS, n  i 1 2 i0   y n i 1 i  x i    2 where 0 is any choice of . So this is a basic problem of minimization min 0   S       0  (y  X  )(y  X  )   0  y y   Xy  y X    XX  : Expanding the SSR  0'  0  S (  )  y y  2  X y    X X  (3.4)  y y  2y X     X X  This (3.4) is the statement of the RSS equivalent to the equation in scalar form. To obtain our OLS estimates we must find the  vector that minimises RSS (in keeping with previous notation 29 we will call the sample estimate b). Differentiating with respect to  and setting equal to zero we get S (  )  2Xy  2XX   0   (3.5) Solving equation (3.5) for  we have: (XX)b  Xy where b is the solution value of  - Solving explicitly for b yields b  (XX) 1 Xy (3.6) which is the matrix equivalent to (3.2). Note that (X’X)-1 must exist (the full rank condition). We know this solution is a maximum because if we take the second derivative of RSS with respect to  (and again referring to the solution estimates as b) we get: S 2 ( b)  2 X X bb  which is clearly positive definite (and hence a minimum). As this expression is quadratic and it is at least positive semi-definite, but we know the inverse exists, therefore making it positive definite. How does it compare with the simple model coefficients? 1 X' X    X 1 ( X' X) 1 1 X2 1 X 1     T  1  1 X 2       X T      T    X t 1 X   t 1 T     t 1  T 2 Xt   t 1  T X t   X t2   X t   t  t 1  2    X T T Xt   Xt  Xt  t   t  1 t 1 t 1 t 1    X t2   X t  1  t  t 1  2   T T ( X t   X t  X t / T )   X t   t 1 t 1 t 1  t 1 1 30 ( X' X) 1   X t2 1  t  2  T  (Xt  X )   Xt  t  1 t 1 1 X' Y    X 1 1 X2  X t 1 t T      Y1   T      Yt    1 Y2     t 1   X T      T     X t Yt   YT   t 1 Therefore, combining equations (13) and (14) we have:   X t2 1  t 1 b  ( X' X) X' Y  2  T  (Xt  X )   Xt  t 1 t 1   T   X  t   Yt    t 1 t 1   T T    X t Yt   t 1  that is, T   2 X   t  Yt   X t  X t Yt   b1  1 t 1 t 1  t 1 T t 1  b     2 b  T ( X  X )  t  2  X t Yt   X t  Yt   T t 1 t 1 t 1 t 1   Therefore, taking the 2 elements separately we have: b2  T  ( X t  X )(Yt  Y ) t 1 T  ( X t  X )2 t 1 Now, rearranging the 1st element it can be shown that b1   Y { X t 1 t t 1 2 t  TX 2 )  ( Yt )TX 2  TX 2  X t Yt t 1 t 1 T  ( X t  X )2  Y  ˆ 2 X t 1 which are equivalent to (2.6). 3.2 Least square assumptions in multiple regression The first 4 assumptions are identical to those imposed in the simple linear model. 31 - Assumption 1: The conditional mean of u is 0 E(ui / X 1i ,.., X ki )  E(ui )  0 (A3.1) On average over the population, Yi falls on the regression line (no systematic error). As we will see below, this is the key assumption that makes OLS unbiased. - Assumption 2: (Xi,Yi) are independently and identically distributed. - Assumption 3: Population variance of u is constant for all i. Formally, this condition can be written as:  u2i   2 i (A3.2) (A3.3) Of course  2 is unknown. This property is known as homoskedasticity (constant variance). This is used below to show that OLS is the Best Linear Unbiased Estimate. In the presence of heteroskedasticity, the standard error of OLS will be biased, but not the estimates. - Assumption 4: Normality assumption (A3.4) - Assumption 5: No perfect multicollinearity Regressors are multicollinear if one of the regressor is a linear function of the others. Algebraically this can be noted as: E(X’X)=K (A3.5) The X’X matrix has full rank. Say we have 3 regressors and 4 observations such that X0 id 1 2 3 4 X1 1 1 1 1 X2 7 5 3 1 X3 2 3 4 5 16 13 10 7 You can see that X3=2*X1+X2. Multicollinearity makes it impossible to calculate OLS since it leads to a division by 0. 32 Statistical packages will therefore not allow you to calculate OLS in these circumstances, and you need to respecify your model. 3.3 Properties of the OLS Estimator What are the virtues of OLS on a statistical basis. Here the discussion must address the issues of unbiasedness, efficiciency etc. Equations (3.6) is the formula for the OLS estimator. From these it is possible to work out the expectation and variance of the estimator, b. 3.2.1 Unbiasedness b  ( X' X) 1 X' Y  ( X' X)  1 X' ( X  )  ( X' X)  1 X' X  ( X' X)  1 X'  as the first term has ( X' X) 1 X' X  I we can write b    ( X' X) 1 X'  Therefore taking expectations of (18) we have E (b)    ( X' X) 1 X' E ()  (3.7) as E(ε)=0 (A3.1). Equation shows that the OLS estimator is unbiased. 3.2.2 Variance of the Estimator V (b)  E[(b  )(b  )' ]  b1  b  E 2   bk      2  b  1  1     k   b2   2  bk   k therefore 33   (b1   1 ) 2 (b1  (b   1 )(b2   2 ) V (b)  E  1   (b1   1 )(bk   k ) (b2  V (b) cov( b1 , b2 )  cov( b1 , b2 ) V (b2 )       cov( b1 , bk ) cov( b2 , bk )   1 )(b2   2 ) (b2   2 ) 2    2 )(bk   k )  (b1   1 )(bk   k )    (b2   2 )(bk   k )      (bk   k ) 2   cov( b1 , bk )    cov( b2 , bk )      var( bk )  We know that: b    ( X' X) 1 X'  Therefore, (b  )(b  )'  ( X' X) 1 X'  ' X( X' X) 1 taking expectations E[(b  )(b  )' ]  ( X' X) 1 X' E ( ' )X(X' X) 1 and as V ()   2 I by assumption (A3.3) we can show   E[( b   )( b   )' ]  ( X' X ) 1 X'  2 I T X ( X' X ) 1   2 ( X' X ) 1 X' X ( X' X ) 1 or V (b)   2 ( X' X) 1 (3.8) Hence, lim(V(b))=0 as n ∞ This formula is valid for the variance of OLS estimates is valid only in the case of homoskedasticity. Since  2 is unknown, we use an unbiased estimator s2. where s 2  e' e nK * Adjusted R2 The coefficient of determination is the proportion of the variation in the dependent variable explained by the model, and is calculated as R2  1  RSS ESS  ,0  R 2  1 TSS TSS 34 This must increase as more variables are added to the equation regardless of their relevance. An alternative therefore is the adjusted R2 which is useful for comparing the fit of specs that differ in the addition or deletion of explanatory variables. In contrast, therefore R2  1  RSS / ( n  k ) TSS / ( n  1) (3.9) only increases if the t-ratio on the included variable is greater than unity. We can see the relationship between the two measures as   (n  1) 1  R2 (n  k ) 1 k (n  1) 2   R nk (n  k ) R2  1  (3.10) i) An increase in R 2 does not necessary means that the added variable is significant. ii) A high R 2 does not mean that the regressors are a true cause of the dependent variable iii) A high R 2 does not mean there is no omitted variable bias iv) A high R 2 your specification is the most appropriate The R 2 tells you only that the regressors are good at explaining the values of the dependent variable in your sample. 3.5 Hypothesis testing 3.5.1 hypothesis testing for a single coefficient. After estimating a coefficient (slope of the regression), you want to assess whether this estimate is statistically different from 0. More generally, you may want to test whether your coefficient b1 is different from b1,0. * Two sided tests H0: b1=b1,0 vs H1: b1  b1,0 This is known in statistics as a t-test. 35 t b1  b1,0 b 1 Comparing t with critical values, you can assess whether H0 is rejected or not. If t<tc, cannot reject H0, b1 is not statistically different from b1,0. Traditionally, econometricians rely on the 95% confidence, thus the critical value for tc = 1.96 (since we have assumed normality). Confidence interval at 90% and 99% are also used, for which the critical values are respectively (1.645 and 2.576). Alternatively, rather than t-values, some econometricians rather present p-values. P-values give the probability of obtaining the corresponding t-statistics as a matter of chance. So whilst when testing whether your coefficient is different from 0 you want a large t-value (greater than 1.96), you want a p-value that is lower than 0.05. p-value gives you the probability of a type I error. The two methods provide the same information, but p-values are easier to interpret as it gives the probability of a Type I error (wrongly rejecting H0). True Decision H0 H1 H0 ok Type I error H1 Type II error ok Type I error: Wrongly reject the true null hypothesis Type II error: Not reject Null when false. The higher your confidence interval the lower the risk of Type I error. (5% vs 1%) but the higher the risk of Type II error. So at the 99% confidence interval, when you accept H1, you are wrong in only 1% of the cases, however this means that you will conclude that your coefficient is not different from 0 in a large number of cases (Type II error). Going back to the output of Table 3.2. We want to know whether our estimates are significantly different from 0 with a 95% confidence interval (probability of Type I error less than 5%). The school estimate as a standard error of 0.006, so t= (0.152 – 0) / 0.0068 = 22.22 36 t>tc = 1.96, so we accept that the schooling coefficient is different from 0 at the 95% confidence interval. Since t>2.576 we also accept it at the 99% confidence interval. In fact our estimate is so precise Table 3.2 Source | SS df MS -------------+-----------------------------Model | 197.859286 2 98.9296432 Residual | 1026.25968 3963 .258960302 -------------+-----------------------------Total | 1224.11896 3965 .308731138 Number of obs F( 2, 3963) Prob > F R-squared Adj R-squared Root MSE = = = = = = 3966 382.03 0.0000 0.1616 0.1612 .50888 -----------------------------------------------------------------------------linc | Coef. Std. Err. t P>|t| [95% Conf. Interval] -------------+---------------------------------------------------------------school | .1525172 .0068646 22.22 0.000 .1390588 .1659757 ability | .0702532 .0089544 7.85 0.000 .0526975 .087809 _cons | 1.793294 .0158411 113.21 0.000 1.762236 1.824351 ------------------------------------------------------------------------------ * one sided t-test Sometimes it makes more sense to test whether a coefficient is greater (smaller) than a given value rather than whether it is different (as in the two sided t-test). H0: b1=b1,0 vs b1<b1,0 The t-test is the same as in the two sided test but the critical values are different. Here is an example demonstrating the logic of a one-sided test. Supposed you received some product, which you know can be of normal or extra quality. However, it is not documented which quality you got. Each quality has a given distribution, so you decide to take a sample and compare your sample mean with the normal quality product mean (your prior here is that you got sent normal quality rather than extra). 37 ONE-TAILED TESTS 0 null hypothesis: H0 : 2 = 2 alternative hypothesis: H1 : 2 = 12 probability density function of b2 2.5% 2.5% 20-2sd 20 -sd 38 20 20+sd 20 +2sd 21 The sample mean (  2 ) is in the critical value to reject that  20 however, it will be stupid to accept  21 since  2 contradicts  21 even more. This is when a one sided test is needed. So basically, we want to get rid off the left hand side critical region. But doing so, we need to increase the right hand side critical region for our significance level to remain at 95%. So we stack all the type I error in the right tail of the distribution. So the critical values for a one sided test are at the 99 and 95 confidence level are (1.645 and 2.326). * Test of join Hypotheses we want to test that all of our estimated coefficients are different from 0. H0: b2=0 & b3 =0 & … & bk=0 vs at least one coefficient is different from 0 The F-test is then defined as: F ( k  1, n  k )  SSE /( k  1) SSR /( n  k ) n n i 1 i 1 2 where : Explained Sum of Square (SSE): SSE   ( yˆ i  y ) 2 and SSR   e i R2  since: SSE SSR 1 SST SST the F-test can also be written as: F ( k  1, n  k )  R 2 /( k  1) (1  R 2 ) /( n  k ) Using Table 2 results, F(2,3963)=(0.16)/2 / (1-.16) / 3963 = 383.03 > Fc(2,3963)=99.50 So we reject that all of our estimates are null. *F-test on a subset of coefficients. It is also possible to conduct hypothesis testing on a subset of coefficients: - 2 restriction case: H0: b2=0 & b3 =0 vs b2 or b3  0. 39  t 32  t 22  2 ˆ t t t 3t 2 31 2 In this case F  1 / 2 *  2  ˆ 1   t 31t 2      In case of more than two variables the formula becomes more complicated, but this test can be done on any econometric package. * Test a single restriction involving multiple coefficients H0 : b1=b2 vs H1: b1  b2 Suppose you regression has the following form: Yi   0  1 X 1   2 X 2  ui To test b1=b2, we add and subtract  2 X 1 , the model then becomes: Yi   0  ( 1   2 ) X 1   2 ( X 1  X 2 )  ui which can be rewritten as: Yi   0  ( 1 ) X 1   2 (W )  ui Then conducting a t-test on  1 is equivalent to the desired F-test. 40 Going back to our pupil teacher ratio and test score example, we assess the effect of the specification on our conclusions, this highlight the problem of omitted variable bias. For omitted variable bias to occur, we must have: At least one regressor is correlated with the omitted variable The omitted variable is a determinant of the outcome Student teacher ratio (1) test score -2.280 (2) test score -1.101 (3) test score -0.998 (4) test score -2.165 (5) test score -1.014 (0.519)** (0.433)* -0.650 (0.270)** -0.122 (0.384)** (0.269)** -0.130 (0.031)** (0.033)** -0.547 % english learner % reduced meal (0.036)** -0.529 (0.024)** % income assistance Constant 698.933 (10.364)** Observations 420 R-squared 0.05 Robust standard errors in -1.036 686.032 (8.728)** 420 0.43 parentheses * 700.150 (5.568)** 420 0.77 significant (0.038)** -0.048 (0.076)** (0.059) 710.406 700.392 (7.819)** (5.537)** 420 420 0.44 0.77 at 5%; ** significant at 1% Previously we estimated model (1). However, we think that the proportion of non-english speaker pupils will affect both the PTR and the score. Model (2) confirms this hunch, since the coefficient on PTR is halved. Both covariates are significant at the 1% level. Note also that the R2 increased substantially, so this specification better fits the data. Adding the proportion of pupils on subsidized meal reduces further the PTR coefficients. All covariates are significant and the R2 reaches 0.77. We then think that all these new covariates may be related to poverty. Adding income assistance to the base model confirms that income, PTR and test are correlated. However, in the complete model, % on income assistance has no significant effect. Note that the R2 does not improve when including this extra variable. So due to omitted variable bias, last week recommendation to the headteacher would have to be revised. Note also, that it does not really matter which control we use for students characteristics 41 since the coefficient on PTR does not change much between model 2 and 5. This could indicate that we have efficiently dealt with the omitted variable bias problems. 42 Chapter 4: Non linear regression functions So far we have assumed linearity of the regression function so that the slope of the population regression function is constant: A unit change in X produced the same effect on Y at all point of the X distribution. This is a stringent hypothesis. What is the effect of X is non constant? We must then think of a non-linear relationship. In fact, there are two types of problems. 1) The effect of a change in X1 on Y depends on the value of X1. 2) The effect of a change in X1 on Y depends on the value another covariate X2. Y Constant slope model X1 X1 Slope depends on X1 X2=0 X2=1 43 X1 Slope depends on X2 4.1. General strategy for modelling non-linear model 4.1.1 Polynomial functions We have seen last week that test score were correlated with some measures of poverty (% families qualifying for income support, % non English speakers). We know use the district median income as a measure of family background. The median income has a mean of $13,700 and ranges from $5,300 to $55,300. The correlation between the 2 variables is 0.71. Regression with robust standard errors Number of obs = F( 1, 420 418) = 273.29 Prob > F = 0.0000 R-squared = 0.5076 Root MSE = 13.387 -----------------------------------------------------------------------------| testscr | Robust Coef. Std. Err. t P>|t| [95% Conf. Interval] -------------+---------------------------------------------------------------avginc | 1.87855 .1136349 16.53 0.000 1.655183 2.101917 _cons | 625.3836 1.867872 334.81 0.000 621.712 629.0552 ------------------------------------------------------------------------------ 44 Fitted values test score 740 720 test score 700 680 660 640 620 600 20 10 0 50 40 30 median district income 60 When income is very low, say under £10,000 most points are under the OLS line, conversely, for high income (>$40,000) all points are below the OLS line. This is because we are missing the curvature of the relationship, by imposing linearity. The relationship looks much more like a quadratic function. Regression with robust standard errors Number of obs = F( 2, 420 417) = 428.52 Prob > F = 0.0000 R-squared = 0.5562 Root MSE = 12.724 -----------------------------------------------------------------------------| testscr | Robust Coef. Std. Err. t P>|t| [95% Conf. Interval] -------------+---------------------------------------------------------------avginc | 3.850995 .2680941 14.36 0.000 3.32401 4.377979 avginc2 | -.0423085 .0047803 -8.85 0.000 -.051705 -.0329119 _cons | 607.3017 2.901754 209.29 0.000 601.5978 613.0056 ------------------------------------------------------------------------------ 45 Fitted values test score Fitted values 740 720 test score 700 680 660 640 620 600 0 10 20 40 30 median district income 50 60 While it looks like we have done a better job at fitting a regression line using a quadratic function, we can this more formally. If we believe that the relationship is linear then the coefficient on earning 2 should not be significantly different from 0. H0: b2=0 vs. b2  0 This is a two sided t-test, t=-.0423/0.0048= -8.81. We cannot accept H0, thus a quadratic is a better fit than the linear model. What is the effect on test score of a change in income of $1,000? * move from $10,000 to $11,000. Yˆ  Yˆ11  Yˆ10 Yˆ  (b0  b1 * 11  b2 * 11 2 )  (b0  b1 * 10  b2 * 10 2 ) Yˆ  644.53  641.57  2.96 46 The predicted difference in the test score achieved by pupils in a district where the median income is $11,000 rather than $10,000 is 2.96 points. * move from $40,000 to $41,000 Similarly, we can calculate that the difference in test score between a district where the median income is $41,000 rather than $40,000 is 0.42 points. If we had believed in the linear model, we would have concluded that at all points of the earnings distribution, an increase of $1,000 leads to a point increase of: 1.87. /* -Standard error of the estimated effects Say that we want to make a recommendation about the potential effect on test score of a change of median income from $10,000 to $11,000, we need to compute the confidence interval around our estimated effect of 2.96. ( b1 X 1  1.96 X 1 X 1 ).  X can be computed from the following: 1 since: Yˆ  b1 * (11  10)  b2 * (11 2  10 2 )  b1  21b2  X   (b1  21b2 ) 1 Statistical digression: We should all know that the F statistics is equivalent to the square of the t-statistics. Hence:  b F  t    b 2 2 b     b  F  */ Quadratic functions can easily be extended to a more general polynomial function 47 Say we have the following function: Yi  b0  b1 X i  b2 X i2  ...  br X ir First, we want to test that a simple linear model would not be the most appropriate. We do so by implementing a F-test: H0 b2=0 & b3=0 & … br=0 vs H1: at least 1 coefficient is different from 0 To do so, we first estimate the restricted model, for which we force the null hypothesis to be true. So we estimate: Y  b0  b1 X , and calculate the SSRr (Residual sum of square of the restricted model). Then, we estimate the unrestricted model, for which the alternative hypothesis is true: Y  b0  b1 X  b2 X 2  ...  br X r , and calculate SSRu. If the sum of square residuals is sufficiently smaller in the unrestricted model than the restricted model, then the test rejects the null hypothesis. F ( q, n )  ( SSRr  SSRu ) / q ( Ru2  Rr2 ) / q  SSRu /( n  k u  1) (1  Ru2 ) /( n  k u  1) where q is the number of restrictions tested (here r-1). Say, that we think the relationship between median income and test score may be cubic, we estimate the restricted model: Restricted model Regression with robust standard errors Number of obs F( 1, 418) Prob > F R-squared Root MSE = = = = = 420 273.29 0.0000 0.5076 13.387 -----------------------------------------------------------------------------| Robust testscr | Coef. Std. Err. t P>|t| [95% Conf. Interval] -------------+---------------------------------------------------------------avginc | 1.87855 .1136349 16.53 0.000 1.655183 2.101917 _cons | 625.3836 1.867872 334.81 0.000 621.712 629.0552 ------------------------------------------------------------------------------ Unrestricted model Regression with robust standard errors Number of obs = F( 3, 416) = Prob > F = 48 420 270.18 0.0000 R-squared Root MSE = = 0.5584 12.707 -----------------------------------------------------------------------------| Robust testscr | Coef. Std. Err. t P>|t| [95% Conf. Interval] -------------+---------------------------------------------------------------avginc | 5.018677 .7073504 7.10 0.000 3.628251 6.409104 avginc2 | -.0958052 .0289537 -3.31 0.001 -.152719 -.0388913 avginc3 | .0006855 .0003471 1.98 0.049 3.27e-06 .0013677 _cons | 600.079 5.102062 117.61 0.000 590.0499 610.108 ------------------------------------------------------------------------------ F(2,416)=(.5584-.5076)/2 / (1-.5584)/(420-4-1)=23.87 > Fc=1.46, so we can reject the null hypothesis that the model is linear, and carry on estimating the cubic model. How many degrees should the polynomial have? There is a tradeoff between flexibility (higher order) and statistical precision. So a solution is to start with a polynomial of order r and test that the coefficient on the rth degree is significantly different from 0. As a matter of fact, economists seldom used polynomial of order greater than 4. 4.1.2 Logarithmic function For variables that are always positive, we can transform them into logarithm (ln x). This provides us a non-linear relationship. The logarithmic has some interesting properties: Ln(1/x)=-ln(x) Ln(ax)=ln(a) + ln(x) Ln(xa)=a*ln(x) Furthermore, for interpretation purposes, we often rely on the following approximation: ln( x  x )  ln( x )  x x when x is small x This leads to the following interpretations, depending on the model specification. 1) Yi  b0  b1 ln( X i )   i a 1% change in x is associated with a change in Y of 0.01b1 2) ln( Yi )  b0  b1 X i   i a change in x by one unit is associated with a 100 b1 change in Y. 3) ln( Yi )  b0  b1 ln( X i )   i A 1% change in x is associated with a b1 % change in Y. This is the elasticity of Y with respect to X. 49 Example 1: linear-log model Yi   0  1 ln( X i )   i test score Fitted values 740 720 test score 700 680 660 640 620 600 0 10 20 30 40 median district income Regression with robust standard errors Number of obs F( 1, 418) Prob > F R-squared Root MSE 50 = = = = = 60 420 679.70 0.0000 0.5625 12.618 -----------------------------------------------------------------------------| Robust testscr | Coef. Std. Err. t P>|t| [95% Conf. Interval] -------------+---------------------------------------------------------------lninc | 36.41968 1.396943 26.07 0.000 33.67378 39.16559 _cons | 557.8323 3.83994 145.27 0.000 550.2843 565.3803 ------------------------------------------------------------------------------ Yˆ  (b0  b1 [ln( X  X )])  (b0  b1 ln( X ))  b1 (ln( X  X )  ln( X ))  b1 ( X / X ) A 1% increase in income ( X / X  0.01 ) is associated with a gain in test score of 0.36 points. What is the predicted gains scores from an increase in median earnings from 10 to $11,000: Yˆ  [557.8  36.42 ln( 11)]  [557.8  36.42 ln( 10)]  3.47 . Similarly, moving from 40 to 41 leads to an increase in test score of 0.90. Example 2: log-linear model. 50 ln( Yi )   o  1 X i   i Fitted values lntest lntest 6.59672 6.40614 0 10 20 50 40 30 median district income 60 The log linear model does not fit the data really well, all obs are underneath the curve at the two tails of the distribution. Regression with robust standard errors Number of obs F( 1, 418) Prob > F R-squared Root MSE = = = = = 420 263.86 0.0000 0.4982 .02065 -----------------------------------------------------------------------------| Robust lntest | Coef. Std. Err. T P>|t| [95% Conf. Interval] -------------+---------------------------------------------------------------avginc | .0028441 .0001751 16.24 0.000 .0024999 .0031882 _cons | 6.439362 .0028938 2225.21 0.000 6.433674 6.445051 ------------------------------------------------------------------------------ ln( Yˆ  Yˆ )  ln( Yˆ )  (b0  b1 [ X  X ])  (b0  b1 X )  b1 ( X )  Yˆ  b1 ( X ) Yˆ A one-unit change in X ( X  1 ) is associated with a 100b1 % change in Y. For each additional increase of $1,000, log test score increases by 0.28%. This model does not capture the curvature 51 of the test score variables. So the effect on test score of moving from 10 to $11,000 is equal than moving from 40 to $41,000. Example 3: Log-log model ln( Yi )   0  1 ln( X i )   i lntest Fitted values lntest 6.56068 6.40614 0 10 20 30 40 median district income 50 60 The log log model fits the model a bit better (higher R2) than the log linear model, but still some problems at the tails of the distribution. Regression with robust standard errors Number of obs F( 1, 418) Prob > F R-squared Root MSE = = = = = 420 667.78 0.0000 0.5578 .01938 -----------------------------------------------------------------------------| Robust lntest | Coef. Std. Err. t P>|t| [95% Conf. Interval] -------------+---------------------------------------------------------------lninc | .055419 .0021446 25.84 0.000 .0512035 .0596345 _cons | 6.336349 .0059246 1069.50 0.000 6.324704 6.347995 ------------------------------------------------------------------------------ 52 ln( Yˆ  Yˆ )  ln( Yˆ )  (b0  b1 ln( X  X ))  (b0  b1 ln( X )) Yˆ X  b1 X Yˆ Y Y 100 * Y  % changeinY  e b1  Y  y/x X X % changeinX 100 * X X b1 is the percentage change in Y associated with a % change in X, this is therefore the elasticity of Y with respect to X. A 1% increase in income is associated with a 0.0554% increase in test score. ! R2 can be used to compare the goodness of fit of the linear vs linear-log model and log-linear vs. log-log model, but cannot be used to compare models were the dependent variable is different. When estimating a model where the dependent variable is in log format, it is not straightforward to predict values of Y. Yi  exp(  0  1 X i   i )  exp(  0  1 X i ) * exp( i ) The problem is that even if E( i  0), E(exp(  i ))  1 . Thus, it is better to leave the predicted values in logarithm format. 4.2 Interactions between variables 4.2.1 Dummy variable It often happen that some of the factors in a regression are qualitative in nature and therefore not measurable in numerical terms: - gender differences - country differences - cohort effect,… Say, we are interested in measuring the earning differential (pay-gap) between men and women. One solution would be to run regressions separately for men and women and then compare the coefficients (see section xx). Alternatively, it is possible to run a pool regression with a gender indicator (dummy variable). 53 A dummy variable takes the value 1 for one category and 0 for all the others. In this simple case, we would create a dummy for men, which will take the value 1 for all men (alternatively, we could have decided: woman=1, man=0). If the character we are interested in has more than two possible alternatives (say region of a country, or quarter of the year), we can create a set of dummy characteristics. ! Creation of dummies can easily lead to problems of multicollinearity. Q1 1 0 0 0 Q2 0 1 0 0 Q3 0 0 1 0 Q4 0 0 0 1 cst 1 1 1 1 As a determinant of lnwage, we rely on experience and gender. * Pooled population . reg lnpay emthemp Source | SS df MS -------------+-----------------------------Model | 13.6560893 1 13.6560893 Residual | 76.2337868 297 .256679417 -------------+-----------------------------Total | 89.8898761 298 .30164388 Number of obs F( 1, 297) Prob > F R-squared Adj R-squared Root MSE = = = = = = 299 53.20 0.0000 0.1519 0.1491 .50664 -----------------------------------------------------------------------------lnpay | Coef. Std. Err. t P>|t| [95% Conf. Interval] -------------+---------------------------------------------------------------emthemp | .0061503 .0008432 7.29 0.000 .0044909 .0078097 _cons | 9.335957 .0796927 117.15 0.000 9.179123 9.492791 ------------------------------------------------------------------------------ Each months of experience on the labour market adds 0.6% to the wage of graduates. This coefficient is significantly different from 0. Now we are interested in gender differences in wages. If we think that women are discriminated at the point of entry to the labour market and that employers offer them lower starting wages, but that after this initial discrimination, experience is rewarded similarly for men and women, we want to fit a model were the relationship between experience and wages differs only by the intercept for both gender. The gender dummy estimates this shift in the intercept. 54 . reg lnpay emthemp p_gender Source | SS df MS -------------+-----------------------------Model | 17.6557409 2 8.82787045 Residual | 72.2341352 296 .244034241 -------------+-----------------------------Total | 89.8898761 298 .30164388 Number of obs F( 2, 296) Prob > F R-squared Adj R-squared Root MSE = = = = = = 299 36.17 0.0000 0.1964 0.1910 .494 -----------------------------------------------------------------------------lnpay | Coef. Std. Err. t P>|t| [95% Conf. Interval] -------------+---------------------------------------------------------------emthemp | .0055561 .0008352 6.65 0.000 .0039125 .0071997 men | .2352692 .0581138 4.05 0.000 .1209006 .3496378 _cons | 9.264647 .0796763 116.28 0.000 9.107844 9.421451 ------------------------------------------------------------------------------ Men are paid 23% more than women. This coefficient is significantly different from 0 (t=6.65). lnpay Fitted values Fitted values 11 lnpay 10 9 8 7 0 50 100 150 month employed Alternatively, we could have separated the sample by gender and run separate regressions for men and women. What is the advantage of using a dummy variable? - Pooling the sample reduces the population variances of the coefficients  smaller standard errors. - The coefficient on experience is easily interpretable - The difference in the two populations is easily estimated. 55 What is the drawback of using a dummy variable? - Assume that the returns to experience are the same for men and women, and the two populations only differ by a shift in the intercept. i.e. We assume: bm=bf This can easily be extended to models were the categorical variable takes on more than two values. One needs to define a reference category (to which the basic intercept applies) and a set of dummies for the other categories. It is often good practice to define the reference category as the dominant one (most observations). Which ever category is chosen as the reference, the R2, coefficients and standard errors on other variables, and the F-statistics will be the same. The only change are for the coefficients and standard errors of the dummies. Even if the final interpretation is not affected. When estimating a model with a group of dummies for a categorical variable, the t-statistics for the significance of each independent dummy can be calculated but more importantly, it is useful to conduct a test of the joint explanatory power of the dummies. As in the case of determining the order of a polynomial function, this is conducted by a F-test. H0 b2=0 & b3=0 & … bd=0 vs H1: at least 1 coefficient is different from 0 First, we want to estimate a model without the set of dummies for the categorical variable, this is the restricted model. Then, we estimate the unrestricted model, with the full set of dummies. We reject H0 is the F(d,n-d-1)>Fc F (d , n  d  1)  ( Ru2  Rr2 ) / d (1  Ru2 ) /( n  k u  1) where d is the number of dummies in the unrestricted model (nbr of category –1). 4.2.2 Interaction between dummy variables Consider that we also want to control for the choice of subject at university (science vs humanities): ln Yi   0  1 Expi   2 Male i   3 Science i   i 56 So we can estimate the effect of having a scientific degree on wages, holding experience and gender constant. But we are concerned that the effect of the choice of subject on wages may be different by gender. So we estimate the following model which include the interaction term between sex and type of subject: ln Yi   0  1 Expi   2 Male i   3 Science i   4 ( Male * science )   i The interaction term coefficient estimates the effect of a science degree on earnings for male. To demonstrate this mathematically, let’s simplify the model to: Yi   0  1 D1   2 D2   3 ( D1 * D2 )   i E(Yi / D1i  d1 , D2i  0)  b0  b1 * d1 E(Yi / D1i  d1 , D2i  1)  b0  b1 * d1  b2  b3 * d1 hence: E(Yi / D1i  d1 , D2i  0)  E(Yi / D1i  d1 , D2i  1)  b2  b3 * d1 Thus the effect of D2 depends on the value of D1. If for individual i, D1=d1=0, then the effect of D2 on Y is :b2 If D1=d1=1, then the effect of D2 on Y is b2+b3*d1 4.2.3 Interactions between a continuous and a binary (dummy) variable Say that we are concerned that gender discrimination does not stop at the recruitment process but that the return to experience varies by gender. By including the G dummy, we allowed the two slopes to have a different intercept but parallel. By adding an interaction term, we allow the two slopes to be different So we want to estimate: ln Yi   0  1 X i   2 Gi   3 ( X i * Gi )   i For women, Gi=0, and the regression line is given by: ln Yi  b0  b1 X i Form men: Gi=1, and the regression line is: ln Yi  (b0  b2 )  (b1  b3 ) X i 57 Fitted values lnpay Fitted values 11 lnpay 10 9 8 7 0 150 100 50 month employed . reg lnpay emthemp p_gender empmal Source | SS df MS -------------+-----------------------------Model | 18.5032689 3 6.1677563 Residual | 71.3866072 295 .241988499 -------------+-----------------------------Total | 89.8898761 298 .30164388 Number of obs F( 3, 295) Prob > F R-squared Adj R-squared Root MSE = = = = = = 299 25.49 0.0000 0.2058 0.1978 .49192 -----------------------------------------------------------------------------lnpay | Coef. Std. Err. t P>|t| [95% Conf. Interval] -------------+---------------------------------------------------------------emthemp | .0038518 .0012333 3.12 0.002 .0014246 .006279 male | -.0367646 .1564554 -0.23 0.814 -.3446747 .2711454 empmal | .0031257 .0016702 1.87 0.062 -.0001613 .0064126 _cons | 9.403501 .1086282 86.57 0.000 9.189717 9.617286 ------------------------------------------------------------------------------ The model predicts that in fact men have lower starting wages than women. The coefficient on male is negative (insignificant) and represent the shift in the intercept, which in this simple model, is the wage for an individual with 0 months of experience. However, for women, returns to experience add 0.3% every month, whilst for men it is 0.69% (.00385+.00312). The interaction term is significantly different from 0 (at the 10% level), so we reject the assumption that the slope on the returns to experience is the same for men and women. To summarise, we could estimate a model where we impose that the two regression lines have different slope but the same intercept. Note that we obtain the same coefficient when we estimate the model separately by gender: 58 . reg lnpay emthemp p_gender empmal Source | SS df MS -------------+-----------------------------Model | 18.5032689 3 6.1677563 Residual | 71.3866072 295 .241988499 -------------+-----------------------------Total | 89.8898761 298 .30164388 Number of obs F( 3, 295) Prob > F R-squared Adj R-squared Root MSE = = = = = = 299 25.49 0.0000 0.2058 0.1978 .49192 -----------------------------------------------------------------------------lnpay | Coef. Std. Err. t P>|t| [95% Conf. Interval] -------------+---------------------------------------------------------------emthemp | .0038518 .0012333 3.12 0.002 .0014246 .006279 male | -.0367646 .1564554 -0.23 0.814 -.3446747 .2711454 empmal | .0031257 .0016702 1.87 0.062 -.0001613 .0064126 _cons | 9.403501 .1086282 86.57 0.000 9.189717 9.617286 -----------------------------------------------------------------------------. reg lnpay emthemp if p_gender==0 Source | SS df MS -------------+-----------------------------Model | 2.36041929 1 2.36041929 Residual | 36.9701933 140 .264072809 -------------+-----------------------------Total | 39.3306126 141 .278940515 Number of obs F( 1, 140) Prob > F R-squared Adj R-squared Root MSE = = = = = = 142 8.94 0.0033 0.0600 0.0533 .51388 -----------------------------------------------------------------------------lnpay | Coef. Std. Err. t P>|t| [95% Conf. Interval] -------------+---------------------------------------------------------------emthemp | .0038518 .0012883 2.99 0.003 .0013047 .0063989 _cons | 9.403501 .1134768 82.87 0.000 9.179152 9.627851 . reg lnpay emthemp if p_gender==1 Source | SS df MS -------------+-----------------------------Model | 9.28773401 1 9.28773401 Residual | 34.4164139 155 .22204138 -------------+-----------------------------Total | 43.7041479 156 .280154794 Number of obs F( 1, 155) Prob > F R-squared Adj R-squared Root MSE = = = = = = 157 41.83 0.0000 0.2125 0.2074 .47121 -----------------------------------------------------------------------------lnpay | Coef. Std. Err. t P>|t| [95% Conf. Interval] -------------+---------------------------------------------------------------emthemp | .0069774 .0010788 6.47 0.000 .0048463 .0091086 _cons | 9.366737 .107857 86.84 0.000 9.153677 9.579796 59 Model specification summary: 1) different intercept, same slope: Yi   0  1 X i   2 Di   i 2) Different intercept, different slope: Yi   0  1 X i   2 Di   3 ( X i  Di )   i 3) Same intercept, different slope: Yi   0  1 X i   3 ( X i  Di )   i Some econometricians do not recommended estimating model 3 and advocate that the primary elements of the interaction terms should always be included in the model. 4.2.4 Interaction between 2 continuous variables Nothing limits us to interaction with dummy variables only. For example, to go back to our Californian schools example, we may believe that the effect of the pupil teacher ratio on test differs depending on the percentage of non-english speakers in the school. We previously estimated a model controlling for the percentage of English speakers but still imposed that the effect of PTR was independent of this proportion. To drop this assumption, we interact PTR and English speaker. Regression with robust standard errors Number of obs F( 3, 416) Prob > F R-squared Root MSE = = = = = 420 155.05 0.0000 0.4264 14.482 -----------------------------------------------------------------------------| Robust testscr | Coef. Std. Err. t P>|t| [95% Conf. Interval] -------------+---------------------------------------------------------------str | -1.117018 .5875135 -1.90 0.058 -2.271884 .0378468 el_pct | -.6729116 .3741231 -1.80 0.073 -1.408319 .0624958 streng | .0011618 .0185357 0.06 0.950 -.0352736 .0375971 _cons | 686.3385 11.75935 58.37 0.000 663.2234 709.4537 ------------------------------------------------------------------------------ When the percentage of English learners is at the median (el_pct=8.85), the slope of the line relating test scores to str is: -1.11 (=-1.12+.0012*8.85). When the percentage of English learners is at the 75% percentile (el_pct=23), the slope is –1.09. For a district with 8.85 non English speakers, reducing the str by 1 unit, improve test scores by 1.11 points, but the same change in str increases test score by only 1.09 is a district with 23% non English speakers. 60 4.3 Chow test If your data consist of two or more subsamples (by gender, cohort,..) you may want to test whether it is appropriate to conduct the analysis on the pool sample (P) or separately for the different subsamples (1 & 2). This is only a special case of a F-test. The unrestricted model is estimating the coefficients separately for the two subsamples. Since the subsamples minimize the RSS for their observations, they must fit the data at least as well as the pooled sample (restricted). There is a price to pay for the fit improvement, since twice as many coefficients need to be estimated, k extra degrees of freedom are used up in the separate estimation. Chow test: H0: b1=b2 vs H1 b1  b2 F ( k , n  2k )  ( RSS p  ( RSS 1  RSS 2 )) / k ( RSS 1  RSS 2 ) /( n  2k ) If F(k,n-2k)>Fc, then reject H0, and it you conclude that it is more appropriate to estimate the model for the two populations separately. (different slope, different intercept). Application to the gender wage gap example: F(2,299-4)=(76.23-(36.97+34.41))/2 (36.97+34.41)/(299-4) = (76.23-71.38)/2 = 2.425 = 10.1 >Fc=3 (71.38)/295 0.24 So we reject Ho and we estimate the model separately by gender. /* A Chow test is simply a test of whether the coefficients estimated over one group of the data are equal to the coefficients estimated over another, and you would be better off to forget the word Chow and remember that definition. 61 Let's pretend you have some model and two or more groups of data. Your model predicts something about the behaviour within the group based on certain characteristics that vary within the group. Under the assumption that each group's behavior is unique, you have y1 = X1*b1 + u1 (equation for group 1) y2 = X2*b2 + u2 (equation for group 2) Now, you want to test whether the behavior for one group is the same as for another, which means you want to test H0: b1 = b2 = ...vs H1 b1  b2 Testing coefficients across separately estimated models is difficult to impossible, depending on things we need not go into right now. A trick is to "pool" the data to convert the multiple equations into one giant equation, so that we can use the tools that we know about: y = d1*(X1*b1 + u1) + d2*(X2*b2 + u2) where y is the set of all outcomes (y_1, y_2), and d1 is a variable that is 1 when the data are for group 1 and 0 otherwise, d2 is 1 when the data are for group 2 and 0 otherwise, .... Rewriting the model a little bit: y =d1*X1*b1 + d2*X2*b2 + d1*u1 + d2*u2 = (X1*d1)*b1 + (X2*d2)*b2 + d1*u1 + d2*u2 By stacking the data, I can get back estimates of b1, b2, ... I include not X_1 in my model, but X_1*d1 (a set of variables equal to X_1 when group is 1 and 0 otherwise); I include not X_2 in my model, but X_2*d2 (a set of variables equal to X_2 when group is 2 and 0 otherwise); and so on. . regress y group1 attitude1 price1 group2 attitude2 price2, nocons What is this nocons option? We must remember that when we estimate the separate models, each has its own intercept. There was an intercept in X_1, X_2, and so on. What I have done above is literally translate y = (X_1*d1)*b1 + (X_2*d2)+b2 + d1*u1 + d2*u2 and so included the variables group1 and group2 (variables equal to 1 for their respective groups) and told Stata to omit the overall intercept. I do not recommend you estimate the model the way I have just illustrated because of numerical concerns -- we'll get to that -- but I do recommend you try it. Estimate the models separately or jointly, and you will get the same estimates for b_1 and b_2. 62 Now we can test whether the coefficients are the same for the two groups: . test _b[attitude1]==_b[attitude2] . test _b[price1]==_b[price2], accum That is the Chow test. Notice that in the Chow test something was omitted: the intercept. If we really wanted to test whether the two groups were exactly the same, we would type . test _b[attitude1]==_b[attitude2] . test _b[price1]==_b[price2], accum . test _b[group1]==_b[group2], accum Using this approach, however, we are not tied down by what the "Chow test" is able to test. We can formulate any hypothesis we want. We might think that price works with same way in both groups, but that attitude works differently, and each group has its own intercept. In that case, we could test . test _b[attitude1]==_b[attitude2] by itself. If we had more variables, we could test any subset of variables. Is "pooling the data justified"? Of course it is: we just established that pooling the data is just another way of estimating separate models and that estimating separate models is certainly justified -- note that we got the same coefficients. That's why I told you to forget the phrase about whether pooling the data is justified. People who ask that don't really mean to ask what they are saying: they mean to ask whether the coefficients are the same. In that case, they should say that. Pooling is always justified, and it corresponds to nothing more than the mathematical trick of writing separate equations, y_1 = X_1*b_1 + u_1 (equation for group 1) y_2 = X_2*b_2 + u_2 (equation for group 2) as one equation y = (X_1*d1)*b1 + (X_2*d2)+b2 + d1*u1 + d2*u2 There are a large number of ways I can write the above equation, and I want to write it a little differently because of numerical concerns. Starting with y = (X_1*d1)*b1 + (X_2*d2)*b2 + d1*u1 + d2*u2 let's do some algebra to obtain y = X*b1 + d2*X_2*(b2-b1) + d1*y1 + d2*u2 where X = (X_1, X_2). In this formulation, I measure not b1 and b2, but b1 and (b2-b1). This is numerically more stable, and I can still test that b2==b1 by testing whether (b2-b1)=0. To estimate this model, I type . regress y attitude price attitude2 price2 group2 and, if I want to test whether the coefficients are the same, I type . test _b[attitude2]=0 . test _b[price2]=0, accum and that will give the same answer yet again. Try it. If I want to test whether *ALL* the coefficients are the same (including the intercept), I type . test _b[attitude2]=0 . test _b[price2]=0, accum . test _b[group2]=0, accum Just as before, I can test any subset. Using this difference formulation, if I had three groups, starting with y = (X_1*d1)*b1 + (X_2*d2)*b2 + (X_3*d3)*b3 + d1*u1 + d2*u2 + d3*u3 as y = X*b1 + (X_2*d2)*(b2-b1) + (X_3*d3)*(b3-b1) + d1*u1 + d2*u2 + d3*u3 and so estimate . regress y attitude price attitude2 price2 group2 /* 63 */ attitude3 price3 group3 and then if I wanted to test whether the three groups were the same in the Wald-test sense, I would type . test _b[attitude2]=0 . test _b[price2]=0, accum . test _b[group2]=0, accum . test _b[attitude3]=0, accum . test _b[price3]=0, accum . test _b[group3]=0, accum which I could more easily type as . testparm attitude2 price2 group2 attitude3 price3 group3 */ 4.4 Oaxaca decomposition Estimating the model separately for the two population, we now want to compare the results. Formally, a Mincer equation is estimated separately for both genders. ln wig  X ig  g   ig (A1) The left-hand side of (2) is the log wage of individual i of gender g, the determinants of which are included in a vector Xig. g is the estimated vector of the returns to characteristics Xig and ig is an individual error term. The average gender gap in earning is decomposed between the mean difference in observed characteristics and the difference in the returns to these characteristics.   ln wm  ln w f  ( X m  X f )  f  (  m   f ) X m (A2) (A1) can be expressed at the mean characteristics of men (m) or women (f)a1. The first term of (A2) is the part of the gender pay gap that can be explained by the differences in the observed characteristics of both groups. The second part, the unexplained component, is the portion of the gap that is due to differences in the returns to characteristics between the two groups. If all the a1 The choice of gender to decompose (A1) is not without effects on the results and alternative decomposition avoiding the bias of choosing one group rather than the other have been proposed (see Cotton, 1988). This debate is beyond the scope of this paper, and we only report results estimated at the mean characteristics of the female population. 64 determinants of earnings were observed in (2) this will be equivalent to a discrimination effecta2; i.e. Gender differences in the returns to observed characteristics are due to the discriminatory behaviour of employers. As typically not all the determinants of (2) are observable, we will refer to the second term of (3) as the unexplained component. Introducing extra variables in the vector X reduces the unexplained part of the gender wage gap. In our simple example of the gender wage gap: . su lnpay emthemp Variable | Obs Mean Std. Dev. Min Max -------------+----------------------------------------------------lnpay | 299 9.876527 .5492212 7.600903 11.0021 emthemp | 299 87.89298 34.8063 0 135 . su lnpay emthemp if p_gender==0 Variable | Obs Mean Std. Dev. Min Max -------------+----------------------------------------------------lnpay | 142 9.717314 .5281482 7.600903 11.0021 emthemp | 142 81.47183 33.59093 19 133 . su lnpay emthemp if p_gender==1 Variable | Obs Mean Std. Dev. Min Max -------------+----------------------------------------------------lnpay | 157 10.02053 .5292965 7.600903 11.0021 emthemp | 157 93.70064 34.97004 0 135   10.02  9.71 =0.49 0.0038   93.7 1  81.47 1 *    0.0070 9.37  o.0038  9.40  93.70 9.40 *    1  =0.0456 + Omitted variables in (A1) bias the estimated g differently for both gender, leading to a perceived differential between genders in the returns to the observed characteristics. a2 65 Chapter 5: Heteroskedasticity and data problems We have only highlighted so far that the properties of the OLS estimate is dependent on the properties of the disturbance term in the regression analysis. So what happens to our estimates when the conditions imposed on the disturbance term are not satisfied. 5.1 Heteroskedasticity So far we have assumed that: i,  u2i   u2 . This is known as homoskedasticity. How can an observation have a variance? Since the observation only takes 1 value, we mean its potential behaviour before the sample is generated. When we write the model : Y   0  1 X  u We state that the disturbance terms u1,…un are drawn from probability distributions that have 0 mean and the same variance. However, their actual value is random. Homoskedasticity states that the probability of u >0 (or <0) is the same for all observations. 66 Y Y = 1 +2X 1 X1 X2 X3 X4 X5 X 10 HETEROSCEDASTICITY Y Y = 1 +2X 1 X1 X2 X3 X4 X5 X 11 1 If there were no disturbance term in the model, all observations would be on the regression line. The effect of the disturbance is to shift each observation upwards or downwards. The potential distribution of the disturbance term is shown by the normal distributions at each point. In this case we have homoskedasticity, each observation has a disturbance term with a distribution of equal variance. However, in some cases, we may think that the disturbance term is a function of a covariate X. So that the disturbance term tends to lie close to the regression line for small values of X and far away for large values of X. Example: Relationship between manufacturing value added and GDP (1994). When GDP is large a 1% variation in manufacturing value added makes a great deal more difference than when GDP small, hence, variations in manufacturing output (expressed in $) are larger for large value of GDP. (should have fitted a log-log model). 300000 Manufacturing 250000 200000 150000 100000 50000 0 0 200000 400000 600000 800000 GDP 1 1000000 12000 Heteroskedasticity does: - not bias in OLS estimates (the proof of unbiasness does not rely on the homoskedasticity assumption). - not bias R2 ( R 2  (1   u2 ) /  2y ). Since both variance are estimated unconditionally, they are not affected by bias in the conditional variance - Bias the standard error of the OLS estimates (no valid testing) Reminder: b  ( X ' X ) 1 X ' Y b    ( X' X ) 1 X'  Therefore, (b  )(b  )'  ( X' X) 1 X'  ' X( X' X) 1 taking expectations E[(b  )(b  )' ]  ( X' X) 1 X' E ( ' )X(X' X) 1 Assuming homoskedasticity we had: V ()   2 I   E[( b   )( b   )' ]  ( X' X ) 1 X'  2 I T X ( X' X ) 1   2 ( X' X ) 1 with heteroskedasticity: E[( b   )( b   )' ]  ( X' X ) 1 X'  2 X ( X' X ) 1 2 5.2 Detection of heteroskedasticity Numerous tests have been devised, we will concentrate here on 4 that hypothesized a relationship between the variance of the disturbance term and the size of the explanatory variable. 5.2.1 Spearman rank correlation test This test assumes a correlation between the absolute size of the residuals and the size of X in an OLS regression. Both X and e are ranked and we define Di has the differences between the rank of X and e for an observation i. Under H0 (homoskedasticity) the population correlation coefficient is 0, and the rank correlation coefficient is distributed N(0,1/(n-1)). n Where r Xe  1  6 D i2 i 1 2 n( n  1) The null hypothesis of homoskedasticity is rejected at the 5% level if | r Xe * n  1 | N c From the above example, we get  D 2 =1608, so rXe=1- (6*1608)/(28*783)=.56 The t-statistics is then t=.56* 27 =2.91. t>tc so we reject H0 (homoskedasticity). … reject at the 1% level also (tc=2.58) If there is more than one observation in the model, the test may be performed with any one of them. 5.2.2 Goldfeld-Quandt test The idea is to divide your sample and do regressions using only the tail of the distribution in X. If heteroskedasticity is true then the RSS in the second subsample is going to be larger than in the fist one. Formally, the observations are ranked by X and to samples (the first n’ and last n’) observations are created. The ratio RSS2/RSS1 is distributed as an F(n’-k, n’-k). Goldfeld and Quandt recommends to use n’=3n/8. 3 H0: RSS2 not greater than RSS1 If F=RSS2 > Fc vs. H1: RSS2>RSS1 reject H0 and accept heteroskedasticity. RSS1 Remark: if the standard deviation of the disturbance term is inversely related to X, the test becomes : RSS1/RSS2. In the previous example, RSS1=157 and RSS2= 13,518, so F=86.1 > F95(9,9)=3.18. Reject H0. 5.2.3 Glejser test Here we do not assume that  ui is proportional to Xi and we investigate various alternative functional form:  u   0  1 X i . i H0:  1 =0: homoskedasticity vs. H1:  1  0 . Say you test values of gamma between –2 and 2. Then you pick the model with the best fit (highest R2) and test H0. 5.2.4 Breush-Pagan test Regress uˆ 2   X   . The F statistic of this regression has an F(k, n-k-1) distribution under the null hypothesis of homoskedasticity. H0: Homoskedasticity H1: Heteroskedasticity 5.3 Robust standard error It is possible to correct standard errors so that they are robust to unknown form of heteroskedasticity. (White, 1980). 4 n As seen in Lecture 2,  b21   ( x i  x ) i2 i 1 nˆ x2 . (5.1) White (1980) shows that the square residuals in observation i can be used as an estimator of  i2 . Thus the White (also called Hubert-White, or simply robust) variance is given by: n  b2   ( x i  x )uˆ i2 i 1 1 (5.2) nˆ x2 The argument hinges on the fact that asymptotically, (5.2) converges towards (5.1) (using TCL). One may wonder whether to use heteroskedasticity consistent (robust) standard error all the time? In fact, reader of Scott and Watson are advised to do so. One reason you may want to be cautious about using robust standard error in all cases is that the robust standard error formula converges towards the true standard error only for large samples, in small sample, the bias induced by robust standard error may be quite large. Alternatively, you may decide to report both standard error, and let the reader decides whether your conclusions are sensitive to the standard error used (Wooldrige). The best strategy is to test for heteroskedasticity and if present use robust standard error. Californian school case: * Goldfeld-Quant test: . sort str . reg testscr str avginc avginc2 if _n<=157 Source | SS df MS -------------+-----------------------------Model | 37291.1283 3 12430.3761 Residual | 26564.4464 153 173.623833 -------------+-----------------------------Total | 63855.5747 156 409.330607 Number of obs F( 3, 153) Prob > F R-squared Adj R-squared Root MSE = = = = = = 157 71.59 0.0000 0.5840 0.5758 13.177 -----------------------------------------------------------------------------testscr | Coef. Std. Err. t P>|t| [95% Conf. Interval] -------------+---------------------------------------------------------------str | -.3953029 .9133449 -0.43 0.666 -2.199698 1.409092 5 avginc | 3.227624 .4699859 6.87 0.000 2.299125 4.156124 avginc2 | -.0327779 .0088873 -3.69 0.000 -.0503356 -.0152203 _cons | 623.442 16.63393 37.48 0.000 590.5802 656.3039 -----------------------------------------------------------------------------. reg testscr str avginc avginc2 if _n>=263 Source | SS df MS -------------+-----------------------------Model | 23852.7352 3 7950.91172 Residual | 26041.0929 154 169.098006 -------------+-----------------------------Total | 49893.828 157 317.795083 Number of obs F( 3, 154) Prob > F R-squared Adj R-squared Root MSE = = = = = = 158 47.02 0.0000 0.4781 0.4679 13.004 -----------------------------------------------------------------------------testscr | Coef. Std. Err. t P>|t| [95% Conf. Interval] -------------+---------------------------------------------------------------str | .6105634 .9286054 0.66 0.512 -1.223885 2.445012 avginc | 4.252809 1.309941 3.25 0.001 1.665037 6.840581 avginc2 | -.0459702 .0420159 -1.09 0.276 -.1289721 .0370317 _cons | 587.3276 23.1539 25.37 0.000 541.5874 633.0679 ------------------------------------------------------------------------------ F(157-4,157-4)=RSS1/RSS2=1.02 Fc(120,120)=1.35, Fc(inf,inf)=1 No strong evidence of heteroskedasticity. * Breusch- Pagan test reg testscr str avginc avginc2 Source | SS df MS -------------+-----------------------------Model | 85759.8879 3 28586.6293 Residual | 66349.7057 416 159.494485 -------------+-----------------------------Total | 152109.594 419 363.030056 Number of obs F( 3, 416) Prob > F R-squared Adj R-squared Root MSE = = = = = = 420 179.23 0.0000 0.5638 0.5607 12.629 -----------------------------------------------------------------------------testscr | Coef. Std. Err. t P>|t| [95% Conf. Interval] -------------+---------------------------------------------------------------str | -.9099512 .3373245 -2.70 0.007 -1.573024 -.2468782 avginc | 3.881859 .302214 12.84 0.000 3.287802 4.475916 avginc2 | -.044157 .0062511 -7.06 0.000 -.0564448 -.0318692 _cons | 625.2308 7.301822 85.63 0.000 610.8777 639.5839 -----------------------------------------------------------------------------. predict temp,resid . gen e2=temp^2 . reg e2 str avginc avginc2 Source | SS df MS -------------+-----------------------------Model | 345803.306 3 115267.769 Residual | 22650470.8 416 54448.247 -------------+-----------------------------Total | 22996274.1 419 54883.709 Number of obs F( 3, 416) Prob > F R-squared Adj R-squared Root MSE = = = = = = 420 2.12 0.0974 0.0150 0.0079 233.34 -----------------------------------------------------------------------------e2 | Coef. Std. Err. t P>|t| [95% Conf. Interval] -------------+---------------------------------------------------------------- 6 str | 3.403189 6.232568 0.55 0.585 -8.848063 15.65444 avginc | -4.656416 5.583849 -0.83 0.405 -15.63249 6.31966 avginc2 | .0211701 .1154992 0.18 0.855 -.2058646 .2482049 _cons | 156.3866 134.9119 1.16 0.247 -108.8074 421.5807 ------------------------------------------------------------------------------ F(3,416)=2.68, so F<Fc reject heteroskedasticity Stata can directly implement the chi(2) version of this test, just type hettest after the initial regression (need to check H0 and the p-value there…) . reg testscr str avginc avginc2 Source | SS df MS -------------+-----------------------------Model | 85759.8879 3 28586.6293 Residual | 66349.7057 416 159.494485 -------------+-----------------------------Total | 152109.594 419 363.030056 Number of obs F( 3, 416) Prob > F R-squared Adj R-squared Root MSE = = = = = = 420 179.23 0.0000 0.5638 0.5607 12.629 -----------------------------------------------------------------------------testscr | Coef. Std. Err. t P>|t| [95% Conf. Interval] -------------+---------------------------------------------------------------str | -.9099512 .3373245 -2.70 0.007 -1.573024 -.2468782 avginc | 3.881859 .302214 12.84 0.000 3.287802 4.475916 avginc2 | -.044157 .0062511 -7.06 0.000 -.0564448 -.0318692 _cons | 625.2308 7.301822 85.63 0.000 610.8777 639.5839 -----------------------------------------------------------------------------. reg testscr str avginc avginc2,robust Regression with robust standard errors Number of obs F( 3, 416) Prob > F R-squared Root MSE = = = = = 420 286.55 0.0000 0.5638 12.629 -----------------------------------------------------------------------------| Robust testscr | Coef. Std. Err. t P>|t| [95% Conf. Interval] -------------+---------------------------------------------------------------str | -.9099512 .3545374 -2.57 0.011 -1.606859 -.2130432 avginc | 3.881859 .2709564 14.33 0.000 3.349245 4.414474 avginc2 | -.044157 .0049606 -8.90 0.000 -.053908 -.034406 _cons | 625.2308 7.087793 88.21 0.000 611.2984 639.1631 ------------------------------------------------------------------------------ Not surprisingly, in this case the difference between robust and non-robust standard error is not really large since not much heteroskedasticity (check data as well). 7 test score 706.75 605.55 25.8 14 Student teacher ratio 5.4 Stochastic regressors So far we have assumed that the regressors are non-stochastic: i.e they do not have a random component, their value is fixed and not affected by the way the sample is selected. Remember, we know that: b2  cov( X , Y ) cov( X , u )  2  var( X ) Var ( X ) E (b2 )   2  Ecov( X , u )  / var( X ) If X is non stochastic,   2  [ E (cov( X , u ) / var( X )]  2 However, if X is stochastic, we no longer have E (var( X ))  var( x) Could we still proof that OLS is an unbiased estimate. 8 1 n ( X i  X )(ui  u ) cov( X , u ) n  i 1  var( X ) var( X )  Xi  X  (u  u ) 1 n     var( X )  i n i 1  cov( X , u )  1 n  X i  X   E (ui  u )  0    E  E   var( X )  n i 1  var( X )  Since E (ui  u )  0 So b2 is still an unbiased estimate. 5.5 Omitted variables and proxy variables 5.5.1 Omitted variable bias Say that the true model is: Y   o  1 X 1   2 X 2  u However, you estimate: Yˆ  b0  b1 X 1 Were b1  b1  cov( X 1 , Y ) var( X 1 ) rather than (5.3) the cov( X 1 , Y ) var( X 2 )  cov( X 2 , Y ) var( X 1 ) var( X 1 ) var( X 2 )  cov( X 1 , X 2 ) 2 The OLS estimate of b1 is then biased. b1   cov( X 1 ,  0  1 X 1   2 X 2  u ) var( X 1 ) cov( X 1 ,  0 )  1 var( X 1 )   2 cov( X 1 , X 2 )  cov( X 1 , u ) var( X 1 )  1   2 cov( X 1 , X 2 ) var( X 1 )  cov( X 1 , u ) var( X 1 ) so if X1 and X2 are non-stochastic: E (b1 )  1   2 cov( X 1 , X 2 ) var( X 1 ) in case of omitted variable, OLS will be biased unless cov(X1,X2)=0. 9 correct expression: In a multivariate case, it is more difficult to predict the impact of omitted variable bias mathematically. 5.5.2 Adding non-necessary variables Suppose the true model is: Y   0  1 X  u but you estimate: Y   0  1 X 1   2 X 2  u OLS estimate will still be unbiased but no longer efficient since:  u2  u2 1 2 rather than  b1    n var( X 2 ) n var( X 2 ) 1  rx21 , x2 2 b1 furthermore adding more covariates may lead to problems of multicollinearity. If the covariates are highly correlated, the variance of the estimate will be large. 5.5.3 Proxy variables If a variable that you would like to introduce in your analysis is missing, rather than forgetting all about it (omitted variable bias) it may be possible to use a proxy variable. A proxy variable is correlated with the unobserved variable, say the relationship is of the following form: X 2   0   1Z . So the estimated model becomes: Y   0  1 X 1   2 ( 0   1 Z ) - b1 will be the same as if X2 had been used in the regression. - R2 will be the same as if X2 had been used in the regression. - bZ is an estimate of  2 1 not  2 - The t-stat for Z is the same as the one that would have been obtained for X2. - The intercept is an estimate of  0   2 0 not  0 . For example, socioeconomic status and income are proxy variables for each other. 5.6 Measurement error 5.6.1 Measurement error in the explanatory variables 10 Yi   0  1 Z i   i however, Z cannot be measured accurately, all that is observable is X: X i  Zi  i We suppose that w ~ N (0, 2 ) and Z has a variance  Z2 . Then we have: Yi   0  1 X i   i  1 i , let’s define ui   i  1 i . So the OLS estimate is: b1  cov( X , Y ) cov( X , u )  1  var( X ) var( X ) But we know that cov(X,u)  0 since Xi and ui are correlated. Even in the case of an infinite sample, this estimate would still be unconsistent. The bias can be calculated. cov( X , u )  cov( Z   ), (  1 )  cov(Z,  ) - 1cov(Z,  )  cov( , ) - 1 cov( ,  )  0 - 1 0  0  1 2 thus, the bias of OLS estimate: b   w2   Z2   w2 1 A solution to deal with measurement error is to use an alternative estimator (instrumental variable). 5.6.2 Measurement error in dependent variable Measurement error in the dependent variable do not matter as much as it can be thought to contribute to the error term. However, by increasing the amount of noise in the model, it will tend to increase standard error. Say that the true relationship is: Qi   0  1 X 1   i However, we only observe: Yi  Qi  ri Hence, we model: Yi   0  1 X 1   i  ri We know that the population variance is:  x21   2   r2 n x2 1 11 5.7 sample selectivity Sample selection occurs when the availability of the data is influenced by s selection process that is related to the value of the dependent variable. This selection introduces correlation between the error term and the regressor and therefore leads to bias OLS estimates. Example: You are interested in the returns to education and want to regress log wage on schooling. However, only working individuals will have a wage. Since the probability of working is not independent on schooling, nor of the error term. The simple fact that someone has a job and thus appears in the dataset provides information that the error term in the regression is positive (on average) and could be correlated with the regressors, this leads to bias OLS estimator. Solutions to sample selectivity will be seen at the end of this course (or Ecometrics II) 5.8 reversed causality So far we have assumed that causality runs from the regressors to the dependent variable (X causes Y), but what if the causality also runs the other way (Y causes X). This is called reversed causality, in such cases OLS estimator is biased and inconsistent. Suppose we have the simultaneous system: The OLS estimate is: b1  1  Yi   0  1 X i  ui X i   0   1Yi   i cov( X i , ui ) var( X i ) cov( X i , ui )  cov( 0   1Yi   i , ui )   1 cov(Yi , ui )  cov( i , ui )   1 cov(  0  1 X i  u i , ui )  0   1 1 cov( X i , ui )   1 u2  1 u2 cov( X i , ui )  1   1 1 12 Thus in case of reverse causality, OLS will be an inconsistent estimate. To deal with simultaneous bias, we can rely either on IV or on experimental data in order to prevent one of the causality. 13 5 Instrumental variables regression As seen previously in case of omitted variable, reversed causality or measurement error, OLS estimates can be biased. IV is a general way to obtain consistent estimator of the unknown coefficients of the population when the regressor X is correlated with the error term u. 5.1 Model and assumption To understand IV, assume that X is composed of two parts, one that is correlated with u and one that is orthogonal to the error. IV allows you to isolate the part of X that is uncorrelated with u. The information about the exogenous component of X is gathered thanks to a (group) of variable called the instrument(s). Yi   0  1 X i  ui where corr(X,u)  0 . Due to omitted variables, X and u are correlated, thus OLS are inconsistent. Instrumental variables estimations uses an additional variable Z (instrumental variable), to isolate that part of X that is uncorrelated with ui. Variables correlated with the error terms are called endogenous, while those uncorrelated with it are called exogenous. Conditions for a valid instrument: - corr (Zi,Xi)  0 The instrument is relevant - corr (Zi,ui)=0 Instrument is exogenous If an instrument is relevant, then variation in the instrument is related to variation in Xi. If in addition the instrument is exogenous, then that part of the variation of Xi captured by the instrumental variable is exogenous. Thus, the instrument captures movement in Xi that are exogenous, and can therefore be used to estimate  1 . 5.2 2-stage least squares estimator The first stage decomposes X into two components: X i   0   1Zi   i 14  i component of X correlated with u  0   1 Z i : component of X independent of u In the second stage the component of X independent of u is used to estimate b1., so we regress: Yi   0   1 Xˆ i  u i The second stage is complicated by the fact that ˆ 0 and ˆ1 are estimates and thus, the standard errors need to be corrected. In large sample, the 2SLS estimator is consistent and normally distributed. Example 1 : Despite all of our care, the estimate of the effect of pupils teacher ratio on test score may still be biased if they suffer from omitted variable bias. As an instrument, we would like a variable that is correlated with the PTR but not with any other variables affecting student performance. It would then be exogenous since it is uncorrelated with the error term. California is affected by earthquake, that may damage some schools but not other. An affected school will close down and other schools in the area would have to double up to accommodate the extra students. Hence, distance to the earthquake epicentre may be considered as a valid instrument. (more example to follow). In the simple case of a one variable model, the 2SLS estimator is given by a simple formula. The second stage is estimated by OLS, so we know that: ˆ1  cov( Xˆ , Y ) / var( Xˆ ) where Xˆ  ˆ0  ˆ1Z Hence, cov( Xˆ , Y )  ˆ1 cov( Z , Y ) var( Xˆ )  ˆ 2 var( Z ) 1 Since ˆ1 is estimated by OLS in the first stage of the 2SLS, we know that: ˆ1  cov( Z , X ) / var( Z ) Hence: ˆ12 SLS  cov( Z , Y ) / cov( Z , X ) 15 Is the 2SLS estimator consistent? cov( Z , Y )   1 n  Z i  Z Yi  Y  n  1 i 1 1 n  Z i  Z 1 X i  X   ui  u  n  1 i 1 1 n  1 cov( Z , X )   Z i  Z ui  u  n  1 i 1  1 cov( Z , X )  1 n  Z i  Z ui  n  1 i 1 hence, we have: 1 ˆ12 SLS  1  1 n n i 1 ( Z i  Z )ui n  (Z i  Z )( X i  X ) (*nominator and denominator by: n  1 n n i 1 When the sample is large, Z   Z i.e. the sample mean is equivalent to the population mean. Let’s define qi  ( Z i   Z )ui , since Z is exogenous, E(qi)=0. The variance of q is:  q2  var Z i   Z u i  q  q2 1 n 2 q , then var( q )     i q n n i 1 The TCL implies that Hence ˆ12 SLS   1  q  q2 ~ N (0,1) q and in large sample is distributed approximately like a normal: cov( Z i , X i ) N ( 1 ,  2 2 SLS ) 1 TCL infers that the 2SLS estimator is normally distributed, so t-test can be used to test the significance of the estimate. The variance of the estimator is given by:  2 2 SLS 1  1 var Z i  Z ui  n [cov( Z i , X i )] 2 16 Application: demand for cigarettes: Taxing cigarettes should reduce consumption and discourage smokers, how much it will affect consumption depends on the price elasticity of the demand for cigarettes. If the elasticity is -.5, price must rice by 40% to reduce consumption by 20%. Using data from 1985-95 by state, we try to estimate this elasticity. We believe that price is endogenous (reverse causality P,Q), hence we need to find an instrument. We rely on sale tax. The tax is correlated with the total price, hence it satisfies the first condition for instrument, but is it independent of the error term. The argument here centre on the fact that sale tax vary by states for policy reasons (differences in the mix between general taxation, differences in social choices,..) so we think that sale tax is uncorrelated with the error term. 1st stage: . reg lnprice saletax if year==1995 Source | SS df MS -------------+-----------------------------Model | .361461579 1 .361461579 Residual | .40597912 46 .008825633 -------------+-----------------------------Total | .7674407 47 .016328526 Number of obs F( 1, 46) Prob > F R-squared Adj R-squared Root MSE = = = = = = 48 40.96 0.0000 0.4710 0.4595 .09394 -----------------------------------------------------------------------------lnprice | Coef. Std. Err. t P>|t| [95% Conf. Interval] -------------+---------------------------------------------------------------saletax | .0201633 .0031507 6.40 0.000 .0138213 .0265053 _cons | 5.037885 .0291078 173.08 0.000 4.979294 5.096476 -----------------------------------------------------------------------------( 1) saletax = 0 F( 1, 46) = Prob > F = 40.96 0.0000 Variation in sales tax explains 46% of the variance of cigarette prices across states. 2nd Stage Source | SS df MS -------------+-----------------------------Model | .424413989 1 .424413989 Residual | 2.35880879 46 .051278452 -------------+-----------------------------Total | 2.78322278 47 .059217506 Number of obs F( 1, 46) Prob > F R-squared Adj R-squared Root MSE = = = = = = 48 8.28 0.0061 0.1525 0.1341 .22645 ------------------------------------------------------------------------------ 17 lnq | Coef. Std. Err. t P>|t| [95% Conf. Interval] -------------+---------------------------------------------------------------lnpriceh | -1.083587 .3766486 -2.88 0.006 -1.841741 -.3254326 _cons | 10.17643 1.959869 5.19 0.000 6.231423 14.12145 ------------------------------------------------------------------------------ If we believe our instrument, the elasticity of demand for cigarette is surprisingly high. Our estimate may be affected by omitted variable bias if for example, sales tax is dependent on income (states with higher average income may rely less on sales tax and more on income tax), and we know that consumption of cigarettes is correlated with income. Need to move to a more general IV model. How does this compare with the OLS estimate: . reg lnq lnprice if year==1995 Source | SS df MS -------------+-----------------------------Model | 1.12929422 1 1.12929422 Residual | 1.65392856 46 .035954969 -------------+-----------------------------Total | 2.78322278 47 .059217506 Number of obs F( 1, 46) Prob > F R-squared Adj R-squared Root MSE = = = = = = 48 31.41 0.0000 0.4058 0.3928 .18962 -----------------------------------------------------------------------------lnq | Coef. Std. Err. t P>|t| [95% Conf. Interval] -------------+---------------------------------------------------------------lnprice | -1.213057 .2164497 -5.60 0.000 -1.648748 -.7773661 _cons | 10.85003 1.126459 9.63 0.000 8.582585 13.11748 ------------------------------------------------------------------------------ 5.3 General IV model In general, a model will include a set of exogenous variable (W) that are not correlated with u and an endogenous variable (X) that needs to be instrumented by (Z). If there is more than one endogenous variables, we need at least as many instruments as there are endogenous variables. The regression is exactly identified if there are as many instruments as endogenous variables, and over-identified if more instruments are available. An underidentified model cannot be estimated. The model of interest is therefore: Yi   0  1 X 1   2W2  ..   rWr  u The first stage is modeled as: X 1   0   1 Z1  ..   m Z m   m1W2  ..   mrWr   18 Note, that all exogenous variable (W) are included in the first stage regression. in case there is more than one endogenous variable, each endogenous variable is estimated in a similar first stage regression, and all predicted values are included in the second stage. 5.4 Instruments’ validity Invalid instruments produce meaningless results, it is therefore crucial to test the instruments’ validity. 5.4.1 Relevance The more relevant the instrument, the more exogenous variation in X is captured. A more relevant instrument produces more accurate estimator. 1 Remember: ˆ12 SLS  1  1 n n i 1 n (Z n i 1 i ( Z i  Z )ui  Z )( X i  X ) IV hinges on the 2SLS having a normal distribution. Similarly to having a larger sample size, having the higher the correlation between Z and X, the better the normal approximation is. Instruments that explain little of the variation in X are called weak instruments. If the instrument is weak, the normal approximation is weak, even if the sample is large, and the 2SLS estimator is biased. . In large sample, Z and X are close to the population means. As previously, let’s define: r 1 n 1 n r   i n  ( Z i   Z )( X i   X ) n i 1 i 1 Hence: 19  q q   1    1   r r ˆ12 SLS since  r2   q   q      q    q   q       r    1     r   r     r    r  1 2 r n If the instrument is irrelevant, E(ri)=cov(Zi,Xi)=0, then the TCL applied to the denominator and we know that r r ~ N (0,1) When the instrument is irrelevant, the bias term is a function of the ratio of 2 normally distributed random variables, that are correlated (X and u are correlated). When the instrument is weak, the distribution of the 2SLS is still non normal. It can be shown that:   ˆ12 SLS   1   1OLS   1 /[ E ( F )  1] where E(F) is the expectation of the 1st stage F statistics. If E(F)=10, then the large sample bias of 2SLS relative to the large sample bias of OLS, is 1/9 or just over 10%. As a rule of thumb, instruments are called weak when a F-statistics on the coefficients of the instrument in the first stage is less than 10. Problems with weak instrument: Example. Angrist and Krueger (1991) rely on quarter of birth as an instrument for education (SLA). However, Bound, Jaeger and Baker (1995) noticed that in some regressions the instruments were weak (F-test =2). They then run the same regression using a randomly generated quarter of birth and found similar results to those of Angrist and Krueger. Since the sample size used was extremely large, the estimates were statistically significant… If you find yourself with weak instruments, it is possible to use Limited Information Maximum Likelihood estimator which under these circumstances perform better than 2SLS. (see hayashi, 2000 for example). 20 5.4.2 Instrument exogeneity If the instrument is not exogenous, it will then be related to the error term and therefore cannot capture the exogenous component of X. Unfortunately, there is no statistical test of the exogeneity of the instrument, so you should use your opinion and argument. If you have more instrument than endogenous variable, you can run an overidentifying restriction test. Imagine that you have 2 instruments and 1 endogenous variable. You could run 2SLS using instrument 1 or instrument 2, if the instruments are both valid (exogenous), the estimates should not be too different (since they both capture the exogenous component of X). If not, then at least one of the instruments is not exogenous. In fact, rather than estimating the various combinations of the instruments, the overidentification test simply consist of regressing the following equation and testing that the coefficients on the instruments are not significant. uˆ i2 SLS   0   1 Z1i  ..   m Z m i   m 1W1i  ..   m  r Wri  e i Ho:  1  ..   m  0 2 The statistics J=mF is in large sample distributed like a  m k where m-k is the degree of overidentification. 5.5 Examples of IV 5.5.1 Example from the returns to education literature: Measurement error, b is biased downward (measurement error = 10%) Omitted variable bias, b is biased upward. there are two avenues to find a good instrument: 21 -Economic Theory (for example difference in cost that are independent of ability or earnings) -college proximity (Kane and Rouse, 1993 ) -Institutional constraints (natural experiment) -Quarter of Birth (Angrist and Krueger, 1991) -Change in compulsory schooling law (Harmon and Walker, 1995) -war draft (Angrist and Krueger, 1992) [table 4, Hdbook, p1835-36] Econometric theory had predicted that due to omitted ability, OLS estimates were biased upwards, however, all IV estimates are at least 30% higher than OLS. How can we explain this puzzle? - the omitted variable bias is small compared to the measurement error bias, thus OLS are bias downwards. However, since measurement error is around 10%, how can the IV estimates be 30% higher than OLS. - Publication bias: Ashenfelter and Harmon (1998). Researchers/publishers prefer high 2 point estimate (evidence of + relationship between (bIV-bOLS) and  IV . - Heterogeneity in the returns to education If individual affected by the instruments have higher than average returns to education then IV returns will be higher than OLS returns that are based on the mean. A condition for this to happen is that the marginal rate of return to education is negatively correlated with the educational attainment. 5.5.2 more detailed example 22 23 To avoid state specific effects, we use differences over a 10 year period rather than annual variable. This strategy allows us to eliminate any state fixed effects. We will therefore measure the long-term elasticity. The first stage reveals that the instruments are significant. Are the instrument exogenous. Column (3), J-stat=4.93>critical (chi2) = 3.84. So we reject H0 and conclude that at least one instrument is endogenous. The estimates from 1 and 2 are two dissimilar to accept that both instruments have captured the exogenous component of X. One may argue that general tax is more likely to be exogenous than cigarette specific tax (for example, if consumption is reduced, the pro-lobby will loose power relative to anti-lobby, and politicians may give in and increase cigarette tax). 5.6 LATE So far we have assumed that causal effects are the same for all individuals. Angrist and Imbens (1994) and Angrist, Imbens and Rubin (1996) have suggested the idea that the IV estimate can be interpreted as the effect of the treatment on the subpopulation of treated who changed behaviour because of the instrument, that is:   E (Yi1 | Di  1, Z i  1)  E (Yi 0 | Di  0, Z i  0) (2.6) They call this interpretation the Local Average Treatment Effect (LATE). LATE requires some assumptions. Assumption 1: Stable Unit Treatment Value If Zi=Zi’ then Di(Z)=Di(Z’) If Zi=Zi’ and Di=Di’ then Yi (Z,D)=Yi (Z’,D’) For individual i, the decision to obtain treatment and the outcome are independent of the instrument, treatment and outcome of other individuals. Assumption 2: Random Assignment Pr(Zi=z)=Pr(Zi=z’) 24 Assumption 3: Exclusion Restriction Z,Z’, D, Y(Z,D)=Y(Z’,D) The effect of the instrument on the outcome is only through the treatment, cov(Y,Z)=0 Assumption 4: Nonzero average causal effect of Z on D E[ Di (Z  1)  Di ( Z  0)]  0 Inbens and Angrist (1994) contribution is the final assumption: Assumption 5: Monotonicity i, D i (1)  Di (0) The population can be divided into four groups; D(0)=0, D(1)=0  Y(0,0)=Y(1,0) D(0)=0, D(1)=1  Y(0,0)<Y(1,1) Never Taker Complier D(0)=1,D(1)=0  Y(0,1)>Y(1,0) D(0)=1, D(1)=1  Y(0,1)=Y(1,1) Defier Always taker Monotonicity assumption rules out the defier category, LATE is therefore identify by compliers only (similar to a fixed effect model) Proof of LATE as in Angrist and Krueger (1999): E[Yi | Z i  1]  E[Yi 0  (Yi1  Yi 0 ) * Di | Z i  1]  E[Yi 0  (Yi1  Yi 0 ) * Di1 ] by independen ce Similarly, E[Yi | Z i  0]  E[Yi 0  (Yi1  Yi 0 ) * Di 0 ] Hence, 25 E[Yi | Z i  1]  E[Yi | Z i  0]  E[(Yi1  Yi 0 )( Di1  Di 0 )]  E[(Yi1  Yi 0 ) | ( Di1  Di 0 )] * P( Di1  Di 0 ) Monotonici ty Similarly, E[ Di | Z i  1]  E[ Di | Z i  0]  E[( Di1  Di 0 )]  P( Di1  Di 0 ) Thus, E[Yi | Z i  1]  E[Yi | Z i  0]  E[Yi1  Yi 0 | Di1  Di 0 ] E[ Di | Z i  1]  E[ Di | Z i  0] The treatment effect identified is an average for those who can be induced to change participation status by a change in the instrument. Dilemma: If instrument is specific, then there is little scope for extrapolating the returns of the treatment to the general population (Heckman’s argument). If the instrument is generic, the monotonicity assumption is likely to be rejected. 26 Chapter 6 Simultaneous equations estimation Most of the material for this chapter has been covered in the last two weeks, so this is only a reminder of why OLS estimates are biased in a system of equation and how to use IV to get unbiased estimates. 6.1 Simultaneous equations So far we have assumed that causality runs from the regressors to the dependent variable (X causes Y), but what if the causality also runs the other way (Y causes X). This is called reversed causality, in such cases OLS estimator is biased and inconsistent. Suppose we have the simultaneous system: Yi   0   1 X i   2Wi  u i (6.1) X i   0   1Yi   2 K i   i Endogenous variables are determined within the system of equations whilst exogenous variables are determined outside it, so here X and Y are endogenous variables, and W and K are exogenous variables. The exogenous variables and the disturbance terms eventually determine the value of the exogenous variables, when the system is solved, this is usually referred as the reduced form of the system. The reduced forms of (6.1) are given by substituting X by its value in the first equation and similarly Y by its value in the second equation. Yi   0   1  0   1 Yi   2 K i   i    2W i  u i Yi  1 1   1 1  0   1 0   1 2 K i   2Wi   1 i  ui  (6.2) X i   0   1  0  1 X i   2Wi  u i    2 K i   i Xi  1  0   1 1   1  2Wi   2 K i   1ui   i  1  1 1 27 (6.3) In the reduced form equations, Y and X are expressed as functions of exogenous and error terms only. Since X is a function of u, the first equation of (6.1) clearly break the independence assumption (cov(X,u)=0). Thus, if we were to estimate this equation by OLS, b1 would be a biased estimate of  1 . Similarly, Y and  are correlated and the OLS estimate of  1 would also be biased. The OLS estimate is: b1  1  cov( X i , ui ) var( X i ) cov( X i , u i )  cov(  0   1Yi   2 K i   i , u i )   1 cov( Yi , u i )  cov( i , u i )   1 cov(  0   1 X i   2W i  u i , u i )  0   1  1 cov( X i , u i )   1 u2 cov( X i , u i )   1 u2 1   1 1   1  0   1  1   1  2Wi   2 K i   1 u i   i  var( X i )  var   1   1 1  1  var  1  2W i   2 K i   1 u i   i  (1 -  1 1 ) 2  1  2  2  W2   22 K2   12 u2   2     1  2 *  1  2  2 WK  2 *  1  2  1 Wu  2 *  1  2 W  2  (1 -  1 1 )   2 *  2  1 Ku  2 *  2 K  2 1 u  Since W, K, u and  are all exogenous variables, their covariance are 0, hence we have: var( X i )  1 (1 -  1 1 ) 2   1 2 2  W2   22 K2   12 u2   2  28 b1  1  1 (1 - 1 1 ) Thus,  1  1   1 1  2   1  1 u2 1   1 1 2 2  W2   22 K2   12 u2   2   1 u2  1  2 2  W2   22 K2   12 u2   2 1   21 u2    1    1  2 2 2 2 2 2 2 1   1  2   W   2  K   1  u       1  2 2  W2   22 K2   2   1   21 u2   1  2 2 2 2 2 2 2     2  2   2 2   2 2   2                1  W 2 K 1 u   1 2 W 2 K 1 u    1 2 b1 is an inconsistent estimate of  1 . A Monte Carlo Experiment Say that we have the following true model: p  1.5  0.5w  u p w  2.5  0.5 p  0.4U  u w remember, we know that the reduced forms are: p w 1 1  1 1 1 1  1 1   0  1 0  1 2U  1u w  u p  0   1 1   2U   1u p  u w  p and w are thus affected by variation in ui, the change in p is 1  1 the change in w. Thus the actual observations are shifted along a line with slope 1  1 . OLS is a compromise between the slope of the true relationship (  1 ) and the slop of the shift line ( 1  1 ). A sample of 20 observations is generated, 10 times: Source | SS df MS -------------+-----------------------------Model | 16.1992808 1 16.1992808 Residual | 12.3040439 18 .683557993 -------------+-----------------------------Total | 28.5033247 19 1.50017498 Number of obs = F( 1, 18) = Prob > F = R-squared = Adj R-squared = Root MSE = 20 23.70 0.0001 0.5683 0.5443 .82678 -----------------------------------------------------------------------------p | Coef. Std. Err. t P>|t| [95% Conf. Interval] 29 -------------+---------------------------------------------------------------w | 1.031237 .2118354 4.87 0.000 .5861878 1.476287 _cons | .4081077 .4563612 0.89 0.383 -.5506717 1.366887 ------------------------------------------------------------------------------ when we know that the true b1 and b0 are 0.5 and 1.5. Replicating this 20 times, we get: sample b0 Se(b0) b1 se(b1) 1 0.41 0.46 1.03 0.21 2 0.45 0.38 1.06 0.17 3 0.65 0.27 0.94 0.12 4 0.41 0.39 0.98 0.19 5 0.92 0.46 0.77 0.22 6 0.26 0.35 1.09 0.16 7 0.32 0.39 1 0.19 8 1.06 0.38 0.82 0.16 9 -.08 0.36 1.16 0.18 10 1.12 0.43 0.69 0.20 The estimates are heavily biased, every estimate of the slope is above 0.5 and those of the intercept are always below 1.5 30 5 4 3 2 1 0 0 1 2 31 3 4 5 4 3 2 1 0 0 1 2 3 6.2 Instrumental variables estimation As we saw last week, b12 SLS  cov( Z , Y ) cov( Z , p)  cov( Z , x) cov( Z , w) The reduced forms show that w is correlated with U, but since U is exogenous, we know that U is not correlated with up. Thus, we can use U has an instrument for w. 32 4 b1IV  cov(U ,  0  1 w  u p ) cov(U , w)  cov(U ,  0 )  cov(U , 1w)  cov(U , u p ) cov(U , w)  0  1  cov(U , u p ) cov(U , w) . reg w U Source | SS df MS -------------+-----------------------------Model | 8.18671935 1 8.18671935 Residual | 7.04603377 18 .39144632 -------------+-----------------------------Total | 15.2327531 19 .801723848 Number of obs F( 1, 18) Prob > F R-squared Adj R-squared Root MSE = = = = = = 20 20.91 0.0002 0.5374 0.5117 .62566 -----------------------------------------------------------------------------w | Coef. Std. Err. t P>|t| [95% Conf. Interval] -------------+---------------------------------------------------------------U | -.4438172 .0970477 -4.57 0.000 -.6477069 -.2399275 _cons | 3.911334 .4470388 8.75 0.000 2.972141 4.850528 -----------------------------------------------------------------------------. predict what if e(sample) (option xb assumed; fitted values) (80 missing values generated) . reg p what Source | SS df MS -------------+-----------------------------Model | .31919131 1 .31919131 Residual | 28.1841334 18 1.56578519 -------------+-----------------------------Total | 28.5033247 19 1.50017498 Number of obs F( 1, 18) Prob > F R-squared Adj R-squared Root MSE = 20 = 0.20 = 0.6570 = 0.0112 = -0.0437 = 1.2513 -----------------------------------------------------------------------------p | Coef. Std. Err. t P>|t| [95% Conf. Interval] -------------+---------------------------------------------------------------what | .1974561 .4373319 0.45 0.657 -.7213441 1.116256 _cons | 2.050352 .9056883 2.26 0.036 .1475713 3.953132 ------------------------------------------------------------------------------ In most cases, it can be argued that some variables are endogenous (measured with error), but is this problem sufficient to grant the use of IV? If we suspect simultaneous bias, OLS will be inconsistent and IV is to be preferred. However, if there is no endogeneity, both OLS and IV will be consistent but OLS will be more efficient. We can test whether IV are needed by using a Durbin-Wu-Hausman test. The Hausman test tests whether the estimates obtained with IV are statistically different from those with OLS. The difference in the estimates is distributed as a Chisquared with k degrees of freedom, where k is the number of instrumented variables. H0:  IV   OLS vs.  IV   OLS If  2 (k )   c2 (k ) accept the null, the coefficients are not statistically significant, therefore should use OLS which is more efficient. Otherwise, reject the null, and should do IV. 33 . ivreg p (w=U) Instrumental variables (2SLS) regression Source | SS df MS -------------+-----------------------------Model | 5.60960442 1 5.60960442 Residual | 22.8937202 18 1.27187335 -------------+-----------------------------Total | 28.5033247 19 1.50017498 Number of obs F( 1, 18) Prob > F R-squared Adj R-squared Root MSE = = = = = = 20 0.25 0.6225 0.1968 0.1522 1.1278 -----------------------------------------------------------------------------p | Coef. Std. Err. t P>|t| [95% Conf. Interval] -------------+---------------------------------------------------------------w | .1974561 .3941549 0.50 0.622 -.6306327 1.025545 _cons | 2.050352 .8162714 2.51 0.022 .3354291 3.765274 -----------------------------------------------------------------------------Instrumented: w Instruments: U -----------------------------------------------------------------------------. hausman, save Note, that the ivreg command in stata, compute the 2 stages least square estimates and correct the standard errors. . reg p w Source | SS df MS -------------+-----------------------------Model | 16.1992808 1 16.1992808 Residual | 12.3040439 18 .683557993 -------------+-----------------------------Total | 28.5033247 19 1.50017498 Number of obs F( 1, 18) Prob > F R-squared Adj R-squared Root MSE = = = = = = 20 23.70 0.0001 0.5683 0.5443 .82678 -----------------------------------------------------------------------------p | Coef. Std. Err. t P>|t| [95% Conf. Interval] -------------+---------------------------------------------------------------w | 1.031237 .2118354 4.87 0.000 .5861878 1.476287 _cons | .4081077 .4563612 0.89 0.383 -.5506717 1.366887 -----------------------------------------------------------------------------. hausman, constant sigmamore ---- Coefficients ---| (b) (B) (b-B) sqrt(diag(V_b-V_B)) | Prior Current Difference S.E. -------------+------------------------------------------------------------w | .1974561 1.031237 -.8337813 .1965241 _cons | 2.050352 .4081077 1.642244 .3870806 --------------------------------------------------------------------------b = less efficient estimates obtained previously from ivreg B = fully efficient estimates obtained from regress Test: Ho: difference in coefficients not systematic chi2( 1) = (b-B)'[(V_b-V_B)^(-1)](b-B) = 18.00 Prob>chi2 = 0.0000 The chi-squared critical value with 1 degree of freedom is 10.8 at the 0.1 % level, so at all level of statistical significance we can reject the null hypothesis that OLS and IV estimates were the same, therefore the model should be estimated by IV. 34 Chapter 7 Experiments and quasi experiments Experiments are fairly rare in social sciences, however there are 3 reasons to study them. 1) provides a benchmark against which to judge estimates of causal effects in practice, 2) understand limits and threats to the validity of experiments, 3) understand quasi-experiment (natural experiment). This is part of econometrics called programme evaluation which is concerned with estimating the effect of a program (policy intervention) on the treated population. 7.1 Idealised experiments and causal effects A randomised controlled experiment randomly selects subjects from a population of interest then randomly assigns them either to the treatment group or control group – the treatment is assigned independently of any of the determinants of the outcome (thus no omitted variable bias). The causal effect of the treatment is the difference in the mean outcome of the two groups. Say Xi indicates the level of treatment received by individual i, this could be a binary variable (treated x=1, control X=0) or continuous (indicating the intensity of the treatment). If Xi is randomly assigned, then Xi is distributed independently of the omitted factors ui. Random assignment of the treatment means that E(ui/Xi)=0 holds automatically. Hence, E (Yi / X i )   o  1 X i (7.1) The causal effects of the treatment on Y is then simply the difference in the conditional expectations: ATE  E (Y / X  x)  E (Y / X  0) Because of random assignment,  1 measures the causal effect of a unit change in X, and is the treatment effect on Y. The causal effect can be estimated by the difference in the mean outcomes of both groups or equivalently by  1 . This is known as the difference estimator. By randomly assigning treatment, an ideal randomized controlled experiment eliminates correlation between the treatment Xi and the error ui, so that the difference estimator is an unbiased and consistent estimate of the causal effect of the treatment on Y. 35 . gen y=1.5+2*treat+u /* true model */ Compmean y, group(treat) Two-sample t test with equal variances -----------------------------------------------------------------------------| Obs Mean Std. Err. Std. Dev. [95% Conf. Interval] ---------+-------------------------------------------------------------------x | 54 1.550462 .1313592 .965289 1.286988 1.813935 y | 46 3.423776 .1564858 1.061338 3.108598 3.738955 ---------+-------------------------------------------------------------------combined | 100 2.412186 .137527 1.37527 2.139303 2.68507 ---------+-------------------------------------------------------------------diff | -1.873315 .2027553 -2.275676 -1.470953 -----------------------------------------------------------------------------Degrees of freedom: 98 Ho: mean(x) - mean(y) = diff = 0 Ha: diff < 0 t = -9.2393 P < t = 0.0000 Ha: diff ~= 0 t = -9.2393 P > |t| = 0.0000 Ha: diff > 0 t = -9.2393 P > t = 1.0000 . reg y treat Source | SS df MS -------------+-----------------------------Model | 87.1712191 1 87.1712191 Residual | 100.074215 98 1.02116546 -------------+-----------------------------Total | 187.245434 99 1.89136802 Number of obs F( 1, 98) Prob > F R-squared Adj R-squared Root MSE = = = = = = 100 85.36 0.0000 0.4655 0.4601 1.0105 -----------------------------------------------------------------------------y | Coef. Std. Err. t P>|t| [95% Conf. Interval] -------------+---------------------------------------------------------------treat | 1.873315 .2027553 9.24 0.000 1.470953 2.275676 _cons | 1.550462 .1375154 11.27 0.000 1.277567 1.823356 ------------------------------------------------------------------------------ 7.2 Potential problems with experiments in practice *Threat to internal validity - Failure to randomise Failure to randomise the assignment to control and treated group means that assignment is correlated with some of the individual characteristics and therefore error term. This leads to bias estimates of the causal effect, since it captures the treatment and selection effects. - Failure to follow treatment protocol People do not always follow what they are told. For example, some of the individual affected to the treatment may refuse to take it, similarly, some of the control individual may manage to get some treatment. Because there is an element of choice in whether the subject receives the treatment, Xi is going to be correlated with ui. Alternatively, some individuals may forget to take the treatment which is equivalent to a measurement error bias. In all cases, partial compliance will lead to bias estimates of the causal effect of the treatment. 36 - Attrition This refers to subject dropping out of the study after being allocated to one group. If attrition occurs for reasons correlated with the assignment to the program, the treatment is correlated with ui. - Experimental effects The mere fact that subjects are in an experiment can change their behaviour (Hawthorne effect). For example, individual in the experiment may increase their effort. To alleviate this problem it is sometime possible to use a double blind protocol where the subjects do not know which group they have been allocated to (usually not in social science). Example, teachers and small class room. - Small sample Because experiments are pricey, sample size can be quite small. The estimate may then be imprecise. *Threat to external validity compromise the ability to replicate the results of the study to other populations -non-representative sample The population studied and the population of interest must be sufficiently similar. -non representative programme the programme of interest must be sufficiently similar to the policy implemented. Differences in funding, and duration can also be important. -general equilibrium effects Turning a small scale and duration programme into a widespread permanent programme might change the economic environment sufficiently that the results of the experiments cannot be generalised. Reducing class size, will affect the demand for teachers and potentially attract to teaching individuals of lower quality, thus reducing the effect of the programme. - treatment vs. eligibility effects When the programme becomes a policy, participation is no longer random but voluntary. Thus the experiment will not provide an unbiased estimator of the programme 37 7.3 Regression estimator of causal effects 7.3.1 Difference estimator with additional regressor Characteristics that are relevant to determining the experimental outcome can be added. For example is a drug test, age, gender and preexisting medical conditions are important determinant of the outcome of the experiment. The model estimated becomes: Yi   0  1 X i   2Wi  ui (7.2) It is important that none of the component of Wi are outcome of the experiment, otherwise Wi is endogenous. So Wi have to be pre-treatment characteristics. Including additional regressors mean that: - b1 is more efficient (smaller variance) since the variance of the error term is reduced by the inclusion of the additional variables. - If the treatment in assigned in a way related to W, then the difference estimator (7.1) is inconsistent and different from (7.2). Thus a large discrepancy between the 2 OLS suggests that Xi was not randomly allocated. - The assignment may depend on pre-treatment characteristics, including these characteristics controls for the probability that the participant is assigned to the treatment. 7.3.2 Differences in differences estimator If data is available pre and post treatment, then a difference in differences estimator can be computed. The estimator of the treatment is then the difference in the change in Y for the treated and control groups.    b1dd  E (Y 1 / X  1)  E (Y 0 / X  1)  E (Y 1 / X  0)  E (Y 0 / X  0)  If the treatment is randomly assigned then bdd is an unbiased and consistent estimator of the causal effect. - Diff in Diff is more efficient than difference estimator if some unobserved determinants of Yi are persistent over time for a given individual. - Diff in diff estimator eliminates pre-treatment differences in Y. If the treatment is correlated with the initial level of Yi, then the difference estimator is biased. 38 Y Diff in Diff Y0/x=1 Diff Y0/x=0 Time 0 1 Y  E (Y 1 / x  1)  E (Y 1 / x  0)  80  30  50     Y  E (Y 1 / x  1)  E (Y 0 / x  1)  E (Y 1 / x  0)  E (Y 0 / x  0)  (80  40)  (30  20)  30 The diff in diff estimate removes the influence of the initial value of Y that vary systematically between the treatment and control groups. Depending on the way your data is organised, the difference in differences estimates can be estimated as: If on each line, you have outcome before and after, then the model takes the following form: yi   0  1treat reg dy treat Source | SS df MS -------------+-----------------------------Model | 81.078286 1 81.078286 Residual | 228.058225 98 2.32712474 -------------+-----------------------------Total | 309.13651 99 3.12259101 Number of obs F( 1, 98) Prob > F R-squared Adj R-squared Root MSE = = = = = = 100 34.84 0.0000 0.2623 0.2547 1.5255 -----------------------------------------------------------------------------dy | Coef. Std. Err. t P>|t| [95% Conf. Interval] -------------+---------------------------------------------------------------treat | 1.80666 .3060794 5.90 0.000 1.199256 2.414065 _cons | -.4559498 .2075931 -2.20 0.030 -.8679116 -.043988 ------------------------------------------------------------------------------ If each observation has the information reported on two different lines, then you need to create a interaction term between treatment and time period (dt). The coefficient of this interaction is your difference in differences estimate. 39 y   0  1treat   2 period   3treat 0  u where treat0=1 for period 0 and 1 if individual is treated in period 1, whilst treat =1 in period 1 only if individual is treated. Thus treat=treat0*period. . reg y dt treat year Source | SS df MS -------------+-----------------------------Model | 94.3171008 3 31.4390336 Residual | 193.507926 196 .987285338 -------------+-----------------------------Total | 287.825027 199 1.44635692 Number of obs F( 3, 196) Prob > F R-squared Adj R-squared Root MSE = = = = = = 200 31.84 0.0000 0.3277 0.3174 .99362 -----------------------------------------------------------------------------y | Coef. Std. Err. t P>|t| [95% Conf. Interval] -------------+---------------------------------------------------------------dt | 1.80666 .2819425 6.41 0.000 1.25063 2.362691 treat | .0666546 .1993635 0.33 0.738 -.3265183 .4598275 year | -.4559498 .1912227 -2.38 0.018 -.833068 -.0788316 _cons | 2.006411 .1352149 14.84 0.000 1.739749 2.273074 ------------------------------------------------------------------------------ The coefficient on treat measures the effect of the fixed unobserved characteristics of the treated group on the outcome (since random assignment, this is not significantly different from 0). Year measures the effect of time that is identical for the two groups. Y 3.34 treatment effect 2 time effect 1.54 0 1 Time If you suspect that the causal effect may vary by subgroups (Z), then you can interact X and Z to estimate the effect of the treatment for the group for which Z=0, and for which Z=1 (assuming Z is binary). . reg dy treat w0 w0t Source | SS df MS -------------+------------------------------ Number of obs = F( 3, 96) = 40 100 39.37 Model | 170.535057 3 56.8450192 Residual | 138.601453 96 1.44376514 -------------+-----------------------------Total | 309.13651 99 3.12259101 Prob > F R-squared Adj R-squared Root MSE = = = = 0.0000 0.5516 0.5376 1.2016 -----------------------------------------------------------------------------dy | Coef. Std. Err. t P>|t| [95% Conf. Interval] -------------+---------------------------------------------------------------treat | 1.961061 .2661259 7.37 0.000 1.432805 2.489316 w0 | -2.752882 .4867962 -5.66 0.000 -3.719165 -1.786599 w0t | .4011066 .6491931 0.62 0.538 -.8875315 1.689745 _cons | -.0990947 .1752667 -0.57 0.573 -.4469963 .2488069 ------------------------------------------------------------------------------ Individuals for which w0=1, have significantly lower change in outcome, but treated individuals from this subgroup benefit more from the treatment (=.40 insignificant). Testing for randomisation If the treatment is randomly assigned, then Xi is not going to be correlated with observable characteristics of the individual. This can be tested by regressing Xi (treatment) on the observable and conduct a F-test. H0:   0 vs.   0 . reg treat w0 Source | SS df MS -------------+-----------------------------Model | .336810773 1 .336810773 Residual | 24.5031892 98 .250032543 -------------+-----------------------------Total | 24.84 99 .250909091 Number of obs F( 1, 98) Prob > F R-squared Adj R-squared Root MSE = = = = = = 100 1.35 0.2486 0.0136 0.0035 .50003 -----------------------------------------------------------------------------treat | Coef. Std. Err. t P>|t| [95% Conf. Interval] -------------+---------------------------------------------------------------w0 | .1545004 .1331174 1.16 0.249 -.1096667 .4186675 _cons | .4337349 .0548857 7.90 0.000 .3248161 .5426538 -----------------------------------------------------------------------------( 1) w0 = 0.0 F( 1, 98) = Prob > F = 1.35 0.2486 7.4 Quasi experiments In a quasi (or natural) experiment, randomness is introduced by variations in individual circumstances that make it appears as if the treatment is randomly assigned. These variations in individual circumstances might arise because of legal institution, location, timing of policies, natural effects or other factors that are unrelated to the causal effect under study. 41 If the quasi experiment determines treatment then the analysis can be conducted using the previous econometric methods as if it was a controlled experiment (example of the Mariel p boatlift). If the quasi experiment influence but do not completely determine the treatment, then the natural experiment can be used as an instrument for X (example of change in school leaving age). Since quasi experiment randomly assign treatment it is important to control for pre-treatment characteristics. Natural experiment are often estimated using pooled cross section. The estimate then takes the form: Say, a policy was introduced in region A at period 1, but not in region B. We have data from 2 cross sections, one before the policy was introduced, the other after policy was introduced in region A. We define, the following variables. Survey =1 is the observation is taken from the second survey and 0 otherwise. Region =1 if observation is from region A (what ever survey it comes from) and treat=region*survey, and is therefore 1 for individuals in region A after the experiment was introduced. b1 measures the causal effect of the policy on Y. y   0  1treat   2 survey   3 region  u * Threat to internal validity. Quasi experiments rely on the randomisation of the treatment, this can be tested by regressing Z on W and test that all coefficients are insignificant. When the population is heterogenous, the estimated effect is the average causal effect. When the experiment is used as an instrument, the estimate is the weighted average of the causal effect where those most affected by the instruments receive higher weights. Suppose now that Z has no influence on the treatment decision for p part of the population and has a nonzero influence on the treatment decision for 1-p of the population. The 2SLS is a consistent estimator of the average treatment effect for the population for which the instrument influences the treatment decision (see discussion on LATE). 42 Topic IX: Limited Dependent Variable Models 9.1 Overview 9.1.1 Motivation Familiarity with microeconometrics techniques has been driven by a large number of factors and influences. In no particular order… Microdata availability – Household/individual/ firm level data increasingly available. Increased computer power. Avoiding aggregation bias – analysis based on individual data is persuasive. Explaining behavioural differences across agents. Distributional issues. No-one has developed a ‘law’ in economics that follows with the fit and stability of the laws of, say, physics. Nor is there any agreement on the ‘best’ way of doing things in economics. For the development of econometrics as a discipline since the 20’s the analysis has concentrated on macro analysis using time series data. Analysis of individual agent data in pure micro/tax/poverty/labour etc. was not apparent leading to huge developments in the theory relating to these areas. The economics journals are full now of applied studies in these fields and the econometric theory has expanded to deal with the developments and needs of the research community. It can be readily seen why – the approach fits more closely with the accepted norms of the theory underlying the empirical research. Think of aggregate statistics on poverty – per capita income for example. The first thing we teach people taking courses in elementary statistics is the unreliable nature of using data on per-capita income to infer anything about the distributional nature of income in an economy. There is no way of exploring for example the effects of a policy change if all we examine is the aggregate poverty measure. In microeconometrics the unit of analysis is the individual economic agent. Instead of dealing with the notion of a ‘representative individual’ who conforms to some economic theory in their behaviour we can 43 look at the actual behaviour of the individual, avoiding the aggregation bias that has befallen macro econometric analysis for some time. Hopefully the microeconometrician may have some insight into how the data used was collected. They may be able to understand and deal with the effects of measurement errors. Macroeconomic data usually are constructed by government bodies or statistics agencies – they may not make public the methods used to generate the data. 9.1.2 Problems Endogenous selection, censoring and individual heterogeneity. Individual behaviour may be very random and even the most successful method of estimation rarely ‘fit’ more than 30% of the variability in the sample. The may offset the beneficial effects of the very large samples and the disaggregated approach. The way therefore that we incorporate the random disturbances into the model can be crucial and this topic has formed a large part of the theoretical advances in the literature. We also make much claim of the ‘explaining individual behaviour’ – but we can’t deal with altruism, imitation etc. outside of the crude use of dummy variables or some other device. Robustness to considerations other than the economic considerations. This, as mentioned in the last point, relates to ‘incidentals’ such as the fact that we are using the normal distribution in assumptions about the behaviour of the error terms in a regression technique is not central to the economic hypothesis being tested but may be critical to the estimation of the parameters of the model. Assumptions about the stochastic nature of the model are hugely important in this literature because of the large number of non-linearities forced upon the structure by the problems of endogenous selection for example. 9.1.3 New Developments Semi-Non Parametric methods Difference-in-Differences. Quasi/Natural experiments. Instrumental variables. 44 Developments in the use of panel data. 9.2 Binary Choice Models 9.2.1 Economic representations under discrete choice Indivisible goods Consider where one or more goods in the budget is indivisible and only available in multiples of some basic unit. This has a big effect where the units are large and costly, in the case of durable goods. Recently in the literature on the economics of fertility a similar form of discreteness is developed. Although children are not a tradable good there is an implicit price in terms of effort and time – the demand for children is the result of a conscious choice which has an economic dimension to it. Discrete qualities of goods Goods can be heterogeneous – think of these products as comprising bundles of characteristics. These characteristics enter the utility function. The field that has received the most attention is mode of transport choice where the competing products are alternative modes of transport for a particular journey. The characteristics are usually in-vehicle and out-of-vehicle travel time, with these differing between individuals. By having considered the effect of certain characteristics on mode choice you can predict the demand of alternatives. Other models have looked at the demand for certain electrical products where operating cost is a major factor, choice of housing location and other models of spatial choice. Discrete choice as an approximation of continuous choice The literature does show examples of where some continuous choice issue (like labour supply) has become over-complicated by the presence of ‘kinks’ in the schedule due to tax, welfare and legal constraints. Choice without markets Application in the field of public policy formation have looked at voting behaviour in the US congress, choice of motorway routes and choice of income policy in a macro context. 45 9.2.2 Discrete choices and econometrics Many of the decisions we make are discrete involving a very well defined number of choices. Discrete choice models have a long history in applied economics and other social sciences. The have become part of the mainstream tools since the 70’s and the work of McFadden following initial work of Quandt and Tobin. Mainstream consumer theory needed to consider issues like the ownership of durables and consumer choices under restriction in the budgets caused by tax and welfare system non-linearities. Firm level interest was spurred by hiring and firing decisions, plant location and product choice. As such, the textbook presentation of consumer choice showed neat smooth choice paths with substitution and a continuum of choices – the real world presents few choices, with limited substitution. In traditional regression analysis the mean of the data was the focus of the technique – here this is of little interest (the mean of a dummy variable is simply the frequency of occurrence). Instead we are concerned with the probabilities of particular outcomes – as such we need to make some assumption on the probability distribution of the model to associate a unique probability with any value taken in the function. As we know distribution functions are non-linear – the normal distribution underlies the Probit model and the logistic distribution underlies the Logit model. A more general class exists whereby the outcome can be multiple – occupational choice or the choice of transport method would be one example where some multinomial variant is needed, usually of the logit model. This variant makes strong assumptions on the independence across possible outcomes, the assumption of independence of irrelevant alternatives (IIA). When the number of choices extend across more than three of four outcomes the econometrics becomes intractable. When the choices represent some cert./PLC/Diploma/Degree etc) an ordered variant is needed. way in which the data is collected. 46 implicit ordering (leaving This often reflects the ‘grouped’ 5.3 Econometric Modelling Framework 5.3.1 Set-up In these models the dependent variable takes the value of one if a certain option is chosen by individual i or a certain event occurs, and zero otherwise. The probability of the event is usually assumed to be a function of explanatory variables such as Pi  Pr yi  1  F ( x'i ), (i  1,....., N ) (1) where xi is a k-element vector of exogenous explanatory variables. The conditional expectation of y given x is equal to F(x’i), which is the probability weighted sum of outcomes for y. We will ignore for the moment the possibility of x being determined endogenously. We require F to follow some rules – we require it to lie in the [0,1] interval and be increasing in x’i to make a connection to a probability statement make sense. As such we can specify F as a cumulative distribution function. Recall for a moment what this means. In the top diagram we observe the frequency distribution – so for any value of X we observe the frequency with which that value occurs. The total area under the PDF equals 1. In the bottom diagram we observe the cumulative frequency – for the value X what we now observe is the frequency that X occurs as well as any value less than the specific X we choose. Thus the CDF tells us the probability of a value less than or equal to X occurring. In the probit model we use the cumulative normal whereas the logit model uses the logistic distribution. The CDF’s are illustrated below. 47 FREQ f(X) PDF X Cumulative Frequency CDF F(X) X 48 Logit and Probit Cumulative Distributions 1 Cumulative Normal 0.75 Probability Cumulative Logistic 0.5 0.25 0 -3 -2 -1.5 -1 -0.5 0 0.5 1 1.5 2 Value of Z Z Cumulative Normal Cumulative Logistic -3 0.00135 0.0474 -2 0.02275 0.1192 -1.5 0.066807 0.1824 -1 0.158655 0.2689 -0.5 0.308538 0.3775 0 0.5 0.5 0.5 0.691462 0.6225 1 0.841345 0.7311 1.5 0.933193 0.8176 2 0.97725 0.8808 3 0.99865 0.9526 49 3 9.3.2 Linear Probability Model We will for the moment examine a model that doesn’t make these assumptions - the linear probability model. We have used dummy variables as explanatory variables. In this section of the course we apply regression techniques to models where the dependent variable is binary suppose for example we wish to make predictions about how individuals vote in a referendum; income might influence the vote so assume that higher income individuals vote ‘Yes’. These models effectively model the likelihood that an individual with a given income will vote ‘Yes’. The simplest way to do this is to apply OLS techniques directly and interpret things in terms of probability. This is known as a linear probability model. Consider a simple model where we plan to model the probability of a person voting ‘Yes’ (the LHS Y) in a referendum on divorce as determined by income (X). Yi    X i   i (2) In this framework we define Y=1 if the person votes yes, and zero otherwise. X is a measure of income and is a random variable with mean zero. If we take the expected value of the function (akin to a statistical regression) we obtain; E (Yi )    X i . (3) We can model the outcomes in terms of a probability. Easy - as Y can only take two values (0 or 1) so define the following: Pi  Pr(Yi  1) 1  Pi  Pr(Yi  0)  E (Yi )  Yi Pr(Yi ) (4)  1( Pi )  0(1  Pi )  Pi Thus the E(Yi) describes the probability that Yi =1, or that the individual votes ‘YES’ given information on their income. The slope of the regression line gives the effect on the probability of voting ‘YES’ of a unit change in income. We can say some more now about the error terms in the model. E ( i )  (1    X i ) Pi  (0    X i )(1  Pi )  0 Solving for Pi we find that variance of the error terms can be expressed as 50 (5) E ( i )  i2  (1    X i ) 2 Pi  (  X i ) 2 (1  Pi ) 2  Pi (1  Pi ) (6)  E (Yi )1  E (Yi ) This clearly shows that the error term is heteroscedastic. If the Pi is close to 0 or close to 1 the variance will be relatively low, but anywhere in between will have high variance. Solutions are as in any model with heteroscedasticity;  Weighted Least Squares but  the predicted value of the predicted value of Y is not bound in the [0,1] interval so these procedures mean prediction greater than 1 or less than 0 are set to near 1 or near 0 and  WLS is not efficient in finite samples. The problem of not being bound in the [0,1] interval can be expressed in the following diagram. Y 1 -2 0 2 X One interesting example comes from examining defaulting payments on bonds. This can be a problem and it would be useful to consider a model that ‘explains’ some of the probability of these defaults occurring. In this US local government authorities can issue bonds. A cross section of 35 Massachusetts communities were used in a linear probability model. The results were 51 Y  1.96  0.029Tax  4.86 Int  0.063 AV  0.007 Dav  0.48Welf , where Y=0 if the municipality defaulted and 1 otherwise (so probability of default decreases as we move from 0 to 1). Tax is the average tax rate, Int is the percentage of the budget allocated to interest payments, AV is the % growth in property values, Dav ratio of debt to assessed valuation and Welf is the % of budget allocated to charity, pensions etc. The tax rate variable is negative so an increase in the tax rate of $1 per 1000 will raise probability of default by 0.029. Higher budget shares for interest payments appear to be associated with higher default probability also. Conversely the growth in assessed valuation of property lowers the probability of defaults, as the tax base is growing. Finally the debt to assessed valuation is also associated with less default which seems counter-intuitive. 9.3.3 The Logit and Probit Models Using a variable y* as an underlying response variable which is unobservable we may write the relationship of interest as being y*  x'   u (7) What we actually observe is the following: y  1 if y*  0 y  0 otherwise (8) In this formulation therefore we no longer think of x’ as being the conditional expectation of Y given the values of x. Rather it is the conditional expectation of y* given the values of x. Pr( y  1)  P(u   x' )  1  F ( x' ) where F is the CDF for u. Thus (9) The observed values of y are realisations of some binomial process with the probabilities given in (9) and varying from trial to trial. It is natural to consider Maximum Likelihood as an estimation method since this is maximising the joint distribution of the outcomes (the y’s) conditional on x as a function of the parameters. The likelihood function can be written as L   F ( x')1  F ( x') . y 0 y 1 52 We specify the functional form of F in (10) on the assumption that u in (7) is distributed logistically or normally - taking the F inbe the standard normal distribution function results in the Probit model while taking F to be the logistic distribution brings the Logit model. 9.4 Binary Response Models We believe that the binary decision is influenced by a set of factors gathered in a vector x so that Pr(Y  1)  F (' x) Pr(Y  0)  1  F (' x) The belief is that the parameters in b reflect the impact of changes in x on the probability. F ( X , )  ' X - Linear Probabilit y ' X F ( X , )   (' X )   (t )dt - Probit  F ( X , )   (' X )  e' X 1  e X - Logit In principle any continuous distribution over the real line will suffice as long as the probability tends to one (zero) in the limit as the function tends to infinity (minus infinity). In the probit model we use the nomal distribution (whose CDF is denoted by ) while in the logit model we use the Logistic (with the CDF denoted by ). Which do we use? In the graph below we observe how the logistic has heavier tails than the normal distribution over the same range. For intermediate values of Z (=’x), such as between 1- and +1, the probabilities are similar. The probability of y=0 is higher in the logistic when ’x is small, and lower when ’x is very large. In principle we expect few differences in the predictions of the models – if our sample has very few ones (or zeros) or has one variable showing very large influence, we might expect differences. 53 Logit and Probit Cumulative Distributions 1 Cumulative Normal 0.75 Probability Cumulative Logistic 0.5 0.25 0 -3 -2 -1.5 -1 -0.5 0 0.5 1 1.5 2 3 Value of Z The probability model is a regression, i.e. E (Y )  0[1  F ()]  1[ F ()]  F () . Whatever distribution is used the parameters of the model are not the ‘slopes’ or marginal effects that we are used to considering in regression models. More generally we see that E (Y )  dF (' x)     f (' x) X  d (' x)  where f(.) is the PDF that corresponds to the cumulative distribution F(.). In the case of the normal distribution we observe therefore that E[ y ]  (' x) . x In the case of the logit model we obtain d[' x] e  'x   (' x)(1  (' x)) d (' x) (1  e  'x ) 2  E[ y ]  (' x)(1  (' x)) x 54 Thus we see that the marginal effects vary with values of x – standard practice is to estimate the Consider the following regression function – the value at the mean value of the regressors. dependent variable is an indicator of whether grades improve after exposure to a PSI, a new teaching practice in economics. GPA is the grade point average, TUCE the score from a pre-test of knowledge, and PSI, a dummy variable equal to one if the student was exposed to PSI. Estimation takes place by maximum likelihood methods using the logistic and normal distributions. Note that comparing the coefficients gives very different results – clearly therefore the procedures are different. However look at the computed slopes – they are practically identical. Variable Logit Probit  Slope  Slope Constant -13.021 -- -7.452 -- GPA 2.826 0.534 1.626 0.533 TUCE 0.095 0.018 0.052 0.017 PSI 2.379 0.449 1.426 0.468 f(’x) 0.189 0.328 Using the coefficients consider the following. We are interested in the impact of PSI on the grades. We can compute the following at the sample means of the x variables for the probit specification: PSI  0 : Pr(GRADE  1)   (7.45  1.62GPA  0.052(21.938) ) PSI  1 : Pr(GRADE  1)   (7.45  1.62GPA  0.052(21.938)  1.4263) 55 Pr(Grade=1) 1.0 with PSI without PSI 2 4 GPA The probability that a student’s grade increases after exposure to PSI is far greater for students with high GPA’s than low GPA’s. At the sample mean GPA of 3.117 the effect of PSI is 0.465, very close to the marginal effect reported in the table. 9.5 Multinomial Models 9.5.1 Unordered Choices The i'th consumer is faced with J choices with the utility of choice j represented by U i , j  ' z ij   ij Pr(U ij  U ik )k  j As we need to estimate multiple integrals the probit is very complicated to use in this context – however the logit model can be extended. If Y represents the choice made, and iff the J disturbances are independent and identically distributed, then 56 Pr(Yi  j )  Pr(Yi  j )  exp ( 'j X ji ) 1  exp (1 X1i )  exp (2 X 2 i ) ' ' exp(  'j X ji ) J 1   exp , j  1,2 , generally ( 'k X i ) k 1 The binomial logit is the special case where J=1. The model implies that we can compute the log-odds ratio for the J choices as  Pr(Yi  j )     j X 'ji . log   Pr(Yi  0)  By assumption the odds ratios for each choice are independent of the other alternatives, or the IIA assumption. 9.5.2 Ordered Choices Consider bond ratings, taste tests, opinion surveys, education choices etc. ordered inherently. All of these are Multinomial logit would not take the implicit ordering into account in any direct manner. Treating the ordering as a linear regression assumes the move from one position to another has identical in importance (overstating the ordinal property). Ordered probit/logit models are useful in this regard and is built around the latent regression format of the binomial function: y*  ' x   As y* is unobserved we actually observe y  0 if y*  0 y  1 if 0  y*  1 y  2 if 1  y*   2 . . . y  J if  j-1  y * Thus 57 Pr( y  0)   (' x) Pr( y  1)   (1  ' x)   (' x) Pr( y  2)   ( 2  ' x)  (1  ' x) . . . Pr( y  J )  1   ( j-1  ' x) 9.6 Censored Data and the Tobit Model Censored data occurs with great regularity in applied economics. If we have a sample that includes non-workers then hours of work will be censored, or values in a certain range are all This is very common – household purchase of transformed to (or reported as) a single value. durables, extramarital affairs, female hours worked, arrests after prison release, expenditure on certain commodity group – all have bee studied in the context of these variants of the literature where the dependent variable is typically zero for a fraction of the observations corresponding to non-workers, non-purchase or some other issue. Although censoring is common in each of these examples the way in which we deal with them is not. Typically we begin by specifying a latent regression model which describes the model in the absence of censoring. So , we may describe the latent dependent variable y* according to yi*  xi'   ui ui ~ N (0,  2 ) where x is a k-vector of exogenous explanatory variables and  is the parameter vector. The different types of censoring models can be distinguished by the observability rule on y*. The simplest is the Tobit rule –  yi* if yi*  0 yi   0 if yi*  0 In this case simply y* is observed when x’ + u exceeds zero. If this was a model of desired consumption then individual i is observed to buy when desired consumption of the good is positive. In micro terms this is a corner solution (although this is not necessarily the case). Estimation of  must account for the censored nature of the data. OLS regression would be 58 biased towards zero for the slope parameters, and underestimate the impact of elements of the x vector. In essence the regression line is being pulled towards the horizontal axis. Eliminating the zero values and estimating on the remaining truncated sample will not get rid of this bias. If we order the data so that the first N1 refer to individuals for whom y is positive the OLS estimator gives  N1  ˆ      xi xi'   i 1  1 N 1 x u i 1 i i Thus the bias in OLS (assuming x is exogenous) is given as  N1  E (ˆ )      xi xi'   i 1  1 N 1  x E (u i 1 i i yi  0) If we evaluate this further we observe that E (ui yi  0)  E (ui ui   xi' )  xi'   u   where  i  i  E   i  i        ( zi )      1  ( zi )  where zi  xi' and  and  are the standard normal and distribution functions respectively. This  ratio is the mean of the standard normal truncated from below at z. It is often termed the hazard or inverse Mill’s ratio since it measures the conditional expectation of individual i with characteristics x whoe remains in the sample after truncation. Typically the IMR is referred to as , or E (ui yi  0) i . The OLS bias depends therefore on the extent to which  differs from zero. In order to avoid the biases direct use of OLS has been dropped in favour of a complicated but tractable maximum likelihood procedure. 59 Chapter 7: Introduction to time series So far we have been modelling cross sections, where individuals were observed only once. Time series analysis is the analysis of an economic series over time. Time series analysis allows to answer question such as what is the causal effect on a variable Y of a change in variable X over time? I.e. What is the dynamic causal effect on Y of a change in X. Also, it allows you to best forecast the value of some variable at a future date. Forecasting models do not need to be causal. Example seeing individuals carrying umbrella make you forecast rain in the future, but umbrella do not cause rain. Regression models can produce reliable forecasts even if their coefficients have no causal interpretation. An important difference between time series and cross section is that the ordering does matter in time series. Definition: A sequence of random variables indexed by time is called a stochasticprocess (stochastic means random) or time series for mere mortal. A data set is one possible outcome (realisation) of the stochastic process, if history had been different, we would observe a different outcome, thus we can think of time series as the outcome of a random variable. 60 61 LHUR 1999:01 1997:01 1995:01 1993:01 1991:01 1989:01 1971:04 1987:01 1985:01 1983:01 1981:01 1979:01 1959:02 1977:01 -2 1975:01 1973:01 1971:01 1969:01 1967:01 1965:01 1963:01 1961:01 1959:01 7.1 Introduction to time series and serial correlation 18 16 14 12 10 8 6 4 2 0 1984:02 1996:04 -4 Series1 Inflation rate LHUR 12 10 8 6 4 2 0 Consumer Price Index and unemployment from 1960-1999. Rather than dealing with individuals observation, the unit of interest in time: value of Y at date t is Y t. The unit of time can be anything from 1 second to 1 year. The value of Y in the previous period is called the first lag value: Y t-1. The jth lag is denoted: Yt-j. Similarly Yt+1 is the value of Y in the next period. The change in the value of Y between period t-1 and t is called the first difference:   Yt  Yt 1 Changes in the value of Y rather than Y are often used because economic series tend to have a trend, say increase over time. If 2 series are trending we could falsely conclude that one is causing the other (spurious regression). A series has a linear trend if: Yt   0  1t  et Trends can also be exponentials, this is modelled as a linear trend of the log of the series. ln( Yt )   0  1t  et Numerous time series are analysed after computing their logarithms or change in logarithms  ln( Yt )  ln( Yt )  ln( Yt 1 ) since they tend to exhibit growth that is approximately exponential. Over the long run, the series grow by a certain percentage every year, and the log of the series grow in a linear way. Also, and more technically, standard errors of series are proportional to their level, thus the standard error of a log series is approximately constant. To deal with trended series, it is possible to differentiate the series or to include a trend in the regression. We will have to be careful not to be carried away when including trends in the regression, as a polynomial trend of high order will track any series pretty well, but offer little help in finding explanatory variables affecting Y t. Seasonality introduce patterns in the data, for example new car models are introduced at the month every year, then for that month we observe higher sales every year. Other example is in the retail sector where sales are expected to be higher in the run up to Christmas. Including a set of dummies for quarter (months) will account for the seasonality of the dependent or independents variables. (remember to leave apart one quarter/month, to avoid perfect multicollinearity). If we want to understand the relationship between 2 or more variables over time, we need to assume some sort of stability over time. It is best to deal with process that are stationary. A stochastic stationary process has the same joint distribution for ( X t ,..., X t  k ) and (X t  j ,..., X t  k  j ) . Trending series is then obviously nonstationary since the mean varies over time. Quarter 1999:01 CPI 164.8667 Annual rate of inflation 1.6 First lag Change in inflation 62 1999:02 1999:03 1999:04 2000:01 2.8 2.8 3.2 4.1 166.0333 167.2 168.5333 170.2667 1.6 2.8 2.8 3.2 1.2 0 0.4 0.9 From the first to the second quarter of 1999, CPI increased from 164.87-166.03, a percentage increase of 0.704%, or at an annual rate 0.704*4=2.8 This percentage change can also be computer using the differences of the log: ln(166.03)-ln(164.87)=0.007 In time series, the value of Y in one period is typically correlated with its value in the next period, this is called serial correlation or autocorrelation. 1st autocorrelation is the correlation between Yt and Yt-1 2nd autocorrelation is the correlation between Yt and Yt-1. jth autocorrelation is the correlation between Yt and Yt-j. Similarly, the jth autocovariance is the covariance between Yt and Yt-j. cov(Yt , Yt  j )  autocorrelation: where T 1  (Yt  Y j 1,T )(Yt  j  Y1,T  j ) T  j  1 t  j 1 ˆ j  (7.1) cov(Yt , Yt  j ) (7.2) var(Yt ) Y j 1,T denotes the sample average of Yt computed over the observations t=j+1,…T. Assuming that the series is stationary, variance of Yt is equal to variance of Yt-1, Say we are interested in the 1st order autocorrelation: j=1 19991 19992 19993 19994 20001 y2,5 y1,4 1.624036 3.236242 2.61377 2.83057 3.236242 2.61377 2.810681 3.236242 2.61377 3.189793 3.236242 2.61377 4.113924 3.236242 2.61377 -1.61221 -0.40567 -0.42556 -0.04645 0.877682 -0.98973 0.2168 0.196911 0.576023 1.500154 cov corr Over the full period, the autocorrelation is: 63 mean 2.913801 0.401507 2.913801 -0.09226 2.913801 -0.00915 2.913801 0.505565 2.913801 0.268555 0.419942 -1.28976 -0.08323 -0.10312 0.275992 1.200123 var y 1.663494 0.006927 0.010634 0.076172 1.440296 0.639504 Lag Inflation rate Change of inflation rate 1 0.85 -0.24 2 0.77 0.27 3 0.77 0.32 4 0.68 -0.06 Inflation is strongly positively autocorrelated, and autocorrelation decreases as the lag increases. This is the long term trend of inflation Changes in inflation are negatively correlated 7.3 Autoregression This is a model that relates a variable to its past value. 7.3.1 1st order autoregressive model. AR(1) If you want to predict the future of a time series, a good place to start is in the immediate past. An AR(1) process takes the following form: Yt   0  1Yt 1  ut For an AR(1) to be stationary, 1  1 . The covariance between yt and yt+h can easily be calculated: y t  h   1 y t  h 1  et  h   1 (  1 y t  h  2  et  h 1 )  et  h   12 y t  h  2   1 et  h 1  et  h  ...   1h y t   1h 1 et 1  ...   1 et  h 1  et  h Premultiplying by yt and taking expectation, we have: cov( t , y t  h )  E ( y t , y t  h )   1h E ( y t2 )   1h 1 E ( y t et 1 )  ...  E ( y t et  h )   1h E ( y t2 )   1h  2y since, et+j is uncorrelated with yt The correlation between yt and yt+h is then: corr ( y t , y t  h )   1h While all observations of yt are correlated, this correlation gets really small as h increases. A systematic way to forecast the change in inflation, Inf t using the previous quarter change Inf t 1 . Using OLS and data from 1962-99, we get: 64 . reg dinf dinf_1 if daten>=19621 & daten<=19994,robust Regression with robust standard errors Number of obs F( 1, 150) Prob > F R-squared Root MSE = = = = = 152 3.89 0.0504 0.0442 1.6866 -----------------------------------------------------------------------------| Robust dinf | Coef. Std. Err. t P>|t| [95% Conf. Interval] -------------+---------------------------------------------------------------dinf_1 | -.2100019 .1064831 -1.97 0.050 -.4204024 .0003986 _cons | .0188995 .1370652 0.14 0.891 -.2519284 .2897274 ------------------------------------------------------------------------------ Change in inflation are negative related to change in the last quarter. More generally, a AR(1) model has the following form: Yt   0  1Yt 1  ut Yˆt / t 1  ˆ 0  ˆ1Yt 1 is the forecast for period t made at period t-1. Then the forecast error is simply: fe  Yt  Yˆt / t 1  Forecast and predicted value - The forecast is not an OLS predicted value: OLS prediction are calculated for the observations in the sample. In contrast, forecast is made for dates beyond the dataset used. - OLS residuals is the difference between the true and predicted value, whilst the forecast error is the difference between the future value of Y and its forecast. Root mean squared forecast error. The root mean squared forecast error (RMSFE) is a measure of the size of the forecast error: RMSFE  E (Yt  Yˆt / t 1 ) 2 RMSFE has 2 sources of error. - Error arising because future value of the error term (ut) are unknown - Error in estimating  0 and 1 . If the first source is larger than the second one then RMSFE  var( ut ) which can be estimated by the standard error of the regression (SER) SER  suˆ , where s û  2 T T 1 SSR 2 2 ˆ u  SSR  uˆ t and   t n  k  1 t 1 n  k 1 t 1 Using our data on inflation for the period 1962-99, what would be our forecast of inflation in 2000:I. Inflation change between 1999:III and 1999IV was 3.2-2.8=0.4. 65 Using our previous estimated values:  inf 2000:I  0.02  .211*  inf 1999:IV  0.02  0.211* 0.4  .006 . Our prediction of inflation for 2000:I is thus: 3.2-0.006=3.1. In fact inflation for that quarter was 4.1%, so we made a large error. This is not surprising since, the R2 of this model is 0.04 so lagged change of inflation explain very little of current change in inflation. The standard error of the regression is 1.67, so ignoring uncertainty arising from the estimation of the coefficients, RMSFE=1.67 percentage points. * the concept of stationarity: Time series used value of the past to predict the future. If the future is like the past, then these historical relationships can be used to forecast the future. But if the future differs fundamentally for the past, then those historical relationships might not be reliable guides to the future. To make reliable forecast, time series need to be stationary. A time series Yt is stationary if its probability distribution does not change over time, the joint distribution of (Ys 1 ,..., Ys T ) does not depend on s. Two times series are said to be jointly stationary if the joint distribution of (Ys 1 ,..., Ys T , X s 1 ,..., X s T ) does not depend on s. 7.3.2 pth order autoregressive model AR(p) The more distant past may have independently affect current value of a variable. One way to incorporate this information is to include additional lags in the AR(1) model. An AR(p) model represents Yt as a linear function of its p lagged values. Yt   0   1Yt 1   2Yt  2  ..   p Yt  p (7.3) Under the assumption that E (u t / Yt 1 ,..., Yt  p )  0 , the best forecast of Yt is obtained using the lag p values of Yt. Any additional lag has no additional information. Regression with robust standard errors Number of obs F( 4, 147) Prob > F R-squared Root MSE = = = = = 152 6.93 0.0000 0.2086 1.5502 -----------------------------------------------------------------------------| Robust dinf | Coef. Std. Err. t P>|t| [95% Conf. Interval] -------------+---------------------------------------------------------------dinf_1 | -.205024 .0992921 -2.06 0.041 -.4012484 -.0087997 dinf_2 | -.3159607 .0870167 -3.63 0.000 -.4879261 -.1439954 dinf_3 | .1977129 .0844289 2.34 0.021 .0308616 .3645642 dinf_4 | -.0358524 .0999216 -0.36 0.720 -.2333207 .161616 _cons | .0238132 .1256191 0.19 0.850 -.2244394 .2720657 -----------------------------------------------------------------------------. testparm dinf_2-dinf_4 ( 1) ( 2) dinf_2 = 0.0 dinf_3 = 0.0 66 ( 3) dinf_4 = 0.0 F( 3, 147) = Prob > F = 6.54 0.0004 Do F test of sig of last 3 terms Note improvement in R2 and SER. The forecast for inflation in 2000:I is now 3.4%. 7.4 Time series with additional predictors Other variables and their lags can be added to improve the prediction. A high value of unemployment tends to be associated with a future decline in the rate of inflation, this is the short-run Philips curve. (cor=-.40) dinf Fitted values 5 4 3 2 dinf 1 0 -1 -2 -3 -4 -5 0 5 unemp_1 10 Including the first lag in unemployment to our previous model of inflation, we get: Regression with robust standard errors Number of obs F( 5, 146) Prob > F R-squared Root MSE = = = = = 152 6.26 0.0000 0.2421 1.5223 -----------------------------------------------------------------------------| Robust dinf | Coef. Std. Err. t P>|t| [95% Conf. Interval] -------------+---------------------------------------------------------------dinf_1 | -.2614728 .0933617 -2.80 0.006 -.4459877 -.0769578 dinf_2 | -.3950265 .0976681 -4.04 0.000 -.5880523 -.2020006 dinf_3 | .1174997 .084659 1.39 0.167 -.0498158 .2848152 dinf_4 | -.0922981 .0965601 -0.96 0.341 -.2831343 .0985382 unemp_1 | -.233865 .1004808 -2.33 0.021 -.4324498 -.0352803 _cons | 1.433162 .5549381 2.58 0.011 .3364128 2.529912 ------------------------------------------------------------------------------ Despite significance of unemployment, the R2 only marginally improves. Our forecast of inflation fro 2000:I is now 3.7. We can add more lags in unemployment. The autoregressive distributed lag model (ADL(p,q)) has the following form: Yt   o   1Yt 1  ...   p Yt  p   1 X t 1   2 X t 2  ...   q X t  q  u t 67 (7.4) The assumption that E (ut / Yt 1 , Yt 2 ,..., X t 1 , X t 2 ...)  0 implies that no additional lags of either Y or X belongs in the ADL model (not significant). P and q are the true lags of the model. Additional variables and their lags can also be added F-stat on the 4 lag value of unemployment reject that all the coefficients are nill, so they are jointly significant, so is the F-test on the significance of the 2nd to 4th lag.. The R2 and SER improve. Our forecast of inflation in now 3.7%. Regression with robust standard errors Number of obs F( 8, 143) Prob > F R-squared Root MSE = = = = = 152 8.03 0.0000 0.3811 1.3899 -----------------------------------------------------------------------------| Robust dinf | Coef. Std. Err. t P>|t| [95% Conf. Interval] -------------+---------------------------------------------------------------dinf_1 | -.3607111 .0921791 -3.91 0.000 -.5429208 -.1785014 dinf_2 | -.3412813 .1007598 -3.39 0.001 -.5404524 -.1421102 dinf_3 | .0771398 .085122 0.91 0.366 -.0911203 .2453998 dinf_4 | -.0334765 .0873061 -0.38 0.702 -.2060537 .1391007 unemp_1 | -2.72218 .4812139 -5.66 0.000 -3.673391 -1.770968 unemp_2 | 3.478651 .9014268 3.86 0.000 1.696808 5.260495 unemp_3 | -1.044318 .902434 -1.16 0.249 -2.828153 .7395158 unemp_4 | .0667446 .448793 0.15 0.882 -.820381 .9538702 _cons | 1.331172 .4768418 2.79 0.006 .3886021 2.273741 -----------------------------------------------------------------------------. testparm ( ( ( ( 1) 2) 3) 4) unemp_1 unemp_2 unemp_3 unemp_4 F( . testparm ( 1) ( 2) ( 3) unemp_1- unemp_4 = = = = 0.0 0.0 0.0 0.0 4, 143) = Prob > F = 8.41 0.0000 unemp_2- unemp_4 unemp_2 = 0.0 unemp_3 = 0.0 unemp_4 = 0.0 F( 3, 143) = Prob > F = 10.12 0.0000 It is also possible to add more covariates and the notations get a bit complicated. To simplify them, let’s introduce the lag notation. The lag operator transform a variable into its lag: LYt  Yt 1 By applying the lag operator twice, we obtain the second lag: L( LYt )  L(Yt 1 )  Yt  2  L Yt 2 This can be generalised to a the jth lag. p The lag polynomial function is: a ( L)  a 0  a1 L  a 2 L  ...  a p L   a j L j 2 p j 0 Multiplying Yt by a(L) yields: 68 p p  p  a ( L)Yt    a j L j Yt   a j L j Yt   a j Yt  j  a 0 Yt  a1Yt 1  ...  a p Yt  p  j 0  j 0 j 0    Thus, an AR(p) model can be written as: Similarly, an ADL(p,q) model is:  a( L)Yt   0  ut a( L)Yt   0  c( L) X t 1  ut If there is more than one additional covariate: a( L)Yt   0  c1 ( L) X 1t 1  c 2 ( L) X 2t 1  ut (7.5) The number of lags can be different for each regressor, say q1 for X1 and q2 for X2. To estimate (7.5) we need to assume the following assumptions: - E(ut / Yt 1 ,..., X 1t 1 ,..., X kt 1 ,...)  0 - (Yt , X 1t ,..., X kt ) have a stationary distribution - (Yt , X 1t ,..., X kt ) and (Yt  j , X 1t  j ,..., X kt  j ) become independent as j gets large. This is also referred as the weak dependence. It insures that in large samples there is sufficient randomness in the data for the law of large numbers and the central limit theorem to hold - (Yt , X 1t ,..., X kt ) have nonzero, finite 4th moments There is no perfect multicollinearity 7.5 Statistical inference, Granger causality test * Granger causality The Granger causality test is the F-testing the hypothesis that the coefficients on all the values of one variable in (7.5) are zero. This null implies that these regressors have no predictive power for Yt (this variable does not cause Yt).Granger (1969). Causality here means only that this variable allows us to make a better forecast of Y t but has no meaning of causation. Thus, it is usually referred to as Granger causality. Looking back at our last regression, the F-test that all the coefficients for unemployment are null, yields a F=8.41. So at the 1% level we cannot reject that unemployment Granger cause change in inflation. Past value of unemployment contain information that is useful for forecasting changes in the inflation rate, beyond that contained in past value of the inflation rate. * Forecast uncertainty The forecast error consists of 2 components: uncertainty from the regression coefficients and uncertainty due to ut. In the case of a ADL(1,1) process, the forecast error is:     YT 1  YˆT 1  uT 1  ˆ 0   0  ˆ1  1 YT  ˆ1   1  X T 69  (7.6) Because uT+1 has conditional mean zero and is homoskedastic, var(uT+1)=  u and is uncorrelated with the expression 2 in brackets. The Mean Squared Forecast Error is thus:   2 MSFE  E  YT 1  YˆT 1    2    var ˆ   u  0 0   ˆ 1    1 YT  ˆ1   1  X T  To compute a forecast interval, it is convenient to assume that uT+1 is normally distributed. Then, (7.6) and the CLT imply that the forecast error is the sum of 2 independent normally distributed terms, so that the forecast error is itself normally distributed with variance equalling MSFE. Then a 95% forecast interval is given by: YˆT 1 / T  1.96SE(YT 1  YˆT 1 / T ) 7.6 Determining the order of an autoregression More lags means more information is used, but at the cost of additional estimation uncertainty (estimating too many coefficients). - F-statistics Start with a model with a large numbers of lags, test whether the coefficient on the last lag is significant, if not reduce the number of lags and start the process again. When the true value of the model is p, the test will still estimate the model to be >p, 5% of the time. - Information criteria Information criteria trade off the improvement in the fit of the model with the number of estimated coefficients. The most popular information criteria are the Bayes Information Criteria, also called Schwarz information criteria and the Akaike Information Criteria. ln T  SSR( p )  BIC ( p )  ln    ( p  1) T  T  , 2  SSR( p )  AIC ( p )  ln    ( p  1) T  T  You choose the model minimizing the information criteria. The difference between the AIC and BIC is that the term in ln T in the BIC is replaced by 2 in the AIC, so the second term (penalty for 70 number of lags) is not as large. A smaller decrease in the SSR is needed in the AIC to justify including an additional regressor. In large sample, AIC will overestimate p with a non-zero probability. Similarly the optimal number of lags in the additional regressors need to be estimated. The same methods can be used, if the regressions has K coefficients including the intercept, then the BIC is: ln T  SSR( K )  BIC ( K )  ln   K T  T  Important: all models should be estimated using the same sample: So make sure to start with the model with the most lags, and keep this as your working sample for this test. In practice a convenient shortcut is to impose that all the regressors have the same number of lags to reduce the number of models that need comparing. 7.7 Nonstationarity I: trends If the dependent variable and/or regressors are nonstationary, then hypothesis testing and forecast will be unreliable. There are two common type of non stationarity, with their own solutions. 7.7.1 What is a trend? A trend is a persistent long-term movement of a variable over time. A deterministic trend is a non-random function of time (linear in time for example). - A stochastic trend is random and varies over time. For example, a stochastic trend in inflation may exhibit a prolonged period of increase followed by a period of decrease. Since economic series are the consequences of complex economic forces, trends are usefully thought of as having a large unpredictable, random component. The random walk model of a trend. The simplest model of a variable with a stochastic trend is the random walk. A time series Yt is said to follow a random walk if the change in Yt is iid: Yt  Yt 1  ut where ut has conditional mean zero: E(ut | Yt 1 , Yt 2 ,...)  0 71 The basic idea of a random walk is that the value of the series tomorrow is its value today plus an unpredictable component. When series have a tendency to increase or decrease, the random walk can include a drift component. t Yt   0  Yt 1  ut  t 0   u j j 0 If Yt follows a random walk then it is not stationary: the variance of the random walk increases over time so the distribution of Yt changes over time. Var(Yt )=var(Yt-1 ) + var(ut ) For Yt to be stationary, we must have var(Yt)=var(Yt-1 ) which imposes that var(ut )=0. Alternatively, say Y0 =0, then Y1=u1 , Y2 =u1 + u2 and more generally, Yt = u1 +u2 +…+ut . Because, the ut are uncorrelated , var(Yt ) =tσu2 The variance of Yt depends on t and increases as t increases. Because the variance of a random walk increases without bound, its population autocorrelations are not defined. The random walk can be seen as a special case of the AR(1) model in which β=1, then Yt contains a stochastic trend and a non stochastic trend. If |β|<1, then Yt is stationary as long as ut is stationary. For an AR(p) to be stationary involves the roots of the following polynomial to be all greater than 1. The roots are found by solving: 1   1 z   2 z 2  ...   p z p  0 In the special case of an AR(1), the polynomial is simply: 1  1 z  0  z  1 1 . The condition that the roots are less than unity is equivalent to  1  1 If an AR(p) has a root equals one, the series is said to have a unit (autoregressive) root. If Yt has a unit root, it contains a stochastic trend and is not stationary. (the two terms can be used interchangeably). If a series has a unit root, the estimator of the autoregressive coefficient in an AR(p) is biased towards 0, t-stat have a non-normal distribution, two independent series may appear related. 1) bias towards 0 72 Suppose that the true model is a random walk: ( Yt  Yt 1  ut ) but the econometrician estimates an AR(1) ( Yt  1Yt 1  ut ). Since the series is non stationary, the OLS assumptions are not satisfied and it can be shown that: E ( ˆ1 )  1  5.3 / T . So with 20 years of quarterly data, you would expect ˆ1  0.934 Monte carlo with 100 replications gives: Variable | Obs Mean Std. Dev. Min Max -------------+----------------------------------------------------RES1 | 100 .9270481 .0570009 .7792342 1.010915 2) non normal distribution If a regressor has a stochastic trend, then OLS t-statistics have a nonnormal distribution under the null hypothesis. One important case in which it is possible to tabulate this distribution is in the context of an AR with unit root; we will go back to this. 3) spurious regression US inflation was rising from the mid-60s through the early 80’s, so was the Japanese GDP over the same period. reg inflation gdp_jp if daten>=19651 & daten<=19814,robust Regression with robust standard errors Number of obs = 68 F( 1, 66) = 113.38 Prob > F = 0.0000 R-squared = 0.5605 Root MSE = 2.2989 -----------------------------------------------------------------------------| Robust inflation | Coef. Std. Err. t P>|t| [95% Conf. Interval] -------------+---------------------------------------------------------------gdp_jp | .1871328 .0175741 10.65 0.000 .152045 .2222207 _cons | -2.938637 .7660354 -3.84 0.000 -4.468076 -1.409198 -----------------------------------------------------------------------------. reg inflation gdp_jp if daten>=19821 & daten<=19994,robust Regression with robust standard errors Number of obs = 70 F( 1, 68) = 5.49 Prob > F = 0.0221 R-squared = 0.0797 Root MSE = 1.5262 -----------------------------------------------------------------------------| Robust inflation | Coef. Std. Err. t P>|t| [95% Conf. Interval] -------------+---------------------------------------------------------------gdp_jp | -.0304821 .0130145 -2.34 0.022 -.0564521 -.0045121 _cons | 6.30274 1.378873 4.57 0.000 3.551242 9.054239 ------------------------------------------------------------------------------ 73 The conflicting results is that both series have stochastic trends. 15 10 5 0 -5 19600 19700 19800 daten 19900 20000 19600 19700 19800 daten 19900 20000 150 100 50 0 74 There is no reason for the 2 series to be related and the strong relationship found over the first period is spurious and only due to stochastic trends. One special case when estimates are reliable despite the presence of trend is when the trend component is the same for the two series, the series are then said to be cointegrated (see section x) 7.7.2 Testing for unit root The most commonly used test in practice is the Dickey and Fuller test. * Dickey Fuller in the AR(1) model In the AR(1) case, we want to test whether 1  1 , if we cannot reject the null hypothesis then Yt contains a unit root and is not stationary (contains a stochastic trend). However, the test is best implemented by substracting Yt-1 to both sides, it then becomes: H0:   0 vs H1:   0 Yt   0  Yt 1  ut where   1  1 The OLS t-stat testing   0 is called the Dickey-Fuller statistics. Note: the test is one sided because the relevant alternative is that the series is stationary. Regression with robust standard errors Number of obs = F( 1, 66) = Prob > F = R-squared = Root MSE = 68 4.47 0.0383 0.0852 1.7954 -----------------------------------------------------------------------------| Robust dinf | Coef. Std. Err. t P>|t| [95% Conf. Interval] -------------+---------------------------------------------------------------inf_1 | -.1559304 .0737577 -2.11 0.038 -.3031924 -.0086683 _cons | 1.07776 .4075892 2.64 0.010 .2639821 1.891538 ------------------------------------------------------------------------------ The DF statistics does not have a normal distribution, so the critical values are specific to the test. Table 7.1 Critical values for Augmented Dickey and Fuller test 10% 5% 1% Intercept only -2.57 -2.86 -3.43 Intercept and time trend -3.12 -3.41 -3.96 So in the previous regression we cannot reject at any level of statistical confidence that   0 , so the series has a unit root, and is not stationary. 75 * Dickey-Fuller test in the AR(p) model For an AR(p), the Dickey Fuller test is based on the following regression: Yt   0  Yt 1   1 Yt 1   2 Yt  2  ...   p Yt  p  ut (7.7) H0:   0 vs H1:   0 The ADF statistics is the OLS-t-statistics testing   0 . If H0 is rejected, Yt is stationary. The number of p-lags needed is unknown. Studies suggest that for the ADF it is better to have too many lags rather than too few, so it is recommended to use the AIC to determine the number of lags for the ADF. * Dickey Fuller allowing for a linear trend Some series have an obvious linear trend (Japanese GDP) so it will be uninformative to test their stationarity without accounting for the trend. Alternatively, if Yt is stationary around a deterministic linear trend, the trend must be added to (7.7) which becomes: Yt   0  t  Yt 1   1 Yt 1   2 Yt  2  ...   p Yt  p  ut If H0 is rejected, Yt is stationary around a deterministic time trend. If the series is found to have a unit root, then the first difference of the series does not have a trend. For example: Yt   0  Yt 1  ut then Yt   0  ut is stationary. Rq: The power of a test is equal to the probability of rejecting a false null hypothesis (1-prob Type II). Monte Carlo have shown that UR test have low power, they cannot distinguish between a unit root and a stationary near unit root process. Thus the test will often indicate that a series contains a UR. yt  1.1yt 1  0.1yt 2   t zt  1.1zt 1  0.15 zt 2   t Checking for UR, 1  1.1 y  0.1 y 2  0 With the first process, we have: ( y  1)(0.1 y  1)  0 y  1, y  10 76 With the second process, we have the following roots: z=0.9405, z=0.1595. So the first process has a UR and the second one is stationary. y z 10 0 -10 -20 0 100 200 t 300 400 Similarly, it can be difficult to distinguish between a trend stationary and a unit root process with drift. wt  1  0.02t   t xt  0.02  xt 1   t / 3 77 w x 10 5 0 -5 0 100 200 t 300 400 In the short run, the forecast from stationary and non-stationary models will be close, however the long term forecast will be quite different. Also, the power of the unit root test is drastically affected by the data generating process. If we inappropriately omit the intercept or time trend, the power of the UR test can go to 0. For example omitting the trend leads to an upward bias in the estimated value of  in: Yt   0  t  Yt 1   1 Yt 1   2 Yt  2  ...   p Yt  p  ut (7.8) Thus a procedure for UR testing can take the following form: 1- Use the least restrictive model (7.8) to test for UR. UR test have low power to reject Ho, so if Ho is rejected there is no need to proceed further. If not go to step 2. 2- Test   0 , if not use (7.8) to test for UR  step 1 If yes, use (7.7) to test for UR, if Ho is rejected conclude no unit root, if not, go to step 3. 3- Test   0 , if not go back to step 2, 78 p If yes, use y y  yt 1   yt  j to test for UR. j 1 7.8 Non stationary: Breaks A second type of nonstationary arises when the population regression function changes over the course of the sample A break can arise either from a discrete change in the population regression coefficients at a distinct date (policy change) or from a gradual evolution of the coefficients over a longer period of time (change in the structure of the economy). If the break is not noticed, estimates will be based on the average behaviour of the series over the period of time and not the true relationship at the end of the period, thus forecast will be poor. 7.8.1 testing for breaks at a known date To keep it simple, let’s consider the ADL(1,1) model. Let’s denote  the period at which the break is supposed to have happened. Create a dummy variable (Dt)taking values 0 before  and 1 after  . D is also interacted to Yt-1 and Xt-1. Yt   0  1Yt 1   1 X t 1   0 Dt   1 ( Dt * Yt 1 )   2 ( Dt * X t 1 )  ut Under the hypothesis of no break,  0   1   2  0 can be tested using a F-test. Under the alternative of a break, at least one of these coefficients will be different from 0. This is usually referred as a Chow test. This approach can be modified to check for a break in a subset of the coefficients by including only the binary variable interactions for that subset of regressions of interest. 7.8.2 Testing for break at an unknown date Often the date of a possible break is unknown, but you may suspect the range during which the break took place, say between  0 and  1 . The Chow test is used to test for breaks at all dates between  0 and  1 ., then using the largest of the resulting F-statistics to test for a break at an unknown date. This is often referred as Quandt Likelihood Ratio. Since, QLR is the largest of a 79 series of F-statistics, its distribution is special and depends on the number of restrictions tested q (nbr of coefficients, including the intercept allowed to break),  0 and  1 , expressed as a fraction of the total sample size. For the large sample approximation to the distribution of the QLR to be a good one,  0 and  1 cannot be too close to the end of the sample, For this reason, the QLR is computed over a trimmed range so that  0  0.15T and  1  0.85T . The QLR test can detect a single discrete break, multiple discrete breaks and/or slow evolution of the regression function. If there is a distinct break in the regression function, the date at which the largest Chow statistics occurs is an estimator of the break date. Say, we want to check that our estimates of the determinants of inflation in the US over the 1962:I and 1999:4 period. More specifically, we are concerned that the intercept and unemployment may have changed over time. The first period we can check for structural break is 0.15T is 1967:4. So we create a dummy variable for observations after 1967:4 and interact it with unemployment variables: Source | SS df MS -------------+-----------------------------Model | 184.330595 13 14.1792765 Residual | 283.045198 138 1.91246756 -------------+-----------------------------Total | 467.375793 151 2.90295524 Number of obs F( 13, 138) Prob > F R-squared Adj R-squared Root MSE = = = = = = 152 7.41 0.0000 0.3944 0.3412 1.3829 -----------------------------------------------------------------------------dinf | Coef. Std. Err. t P>|t| [95% Conf. Interval] -------------+---------------------------------------------------------------dinf_1 | -.4009554 .0824812 -4.86 0.000 -.5639484 -.2379623 dinf_2 | -.3433158 .0892349 -3.85 0.000 -.5196549 -.1669767 dinf_3 | .0545284 .0850863 0.64 0.523 -.1136126 .2226693 dinf_4 | -.038809 .0754606 -0.51 0.608 -.1879284 .1103105 unemp_1 | -1.719641 1.254766 -1.37 0.173 -4.199214 .7599307 unemp_2 | 3.46834 2.364168 1.47 0.144 -1.203546 8.140225 unemp_3 | -3.370699 2.164944 -1.56 0.122 -7.648893 .9074963 unemp_4 | 1.666702 1.155521 1.44 0.151 -.6167486 3.950152 D | 1.775541 1.839904 0.97 0.336 -1.860335 5.411417 D_unemp_1 | -1.225527 1.351754 -0.91 0.366 -3.896758 1.445703 D_unemp_2 | .2032217 2.560099 0.08 0.937 -4.855847 5.26229 D_unemp_3 | 2.394236 2.370403 1.01 0.314 -2.28997 7.078442 D_unemp_4 | -1.668078 1.255425 -1.33 0.186 -4.148952 .8127955 _cons | -.2276938 1.757672 -0.13 0.897 -3.701068 3.245681 -----------------------------------------------------------------------------. testparm D-D_unemp_4 ( ( ( ( ( 1) 2) 3) 4) 5) D = 0.0 D_unemp_1 D_unemp_2 D_unemp_3 D_unemp_4 F( = = = = 0.0 0.0 0.0 0.0 5, 148) = Prob > F = 0.85 0.5135 F=0.85, we now re-estimate this model with D=1 if t>=1968:1, and until 1993:I. 80 For example, a break at 1981:4 leads to Regression with robust standard errors Number of obs F( 13, 138) Prob > F R-squared Root MSE = = = = = 152 8.42 0.0000 0.4223 1.367 -----------------------------------------------------------------------------| Robust dinf | Coef. Std. Err. t P>|t| [95% Conf. Interval] -------------+---------------------------------------------------------------dinf_1 | -.4075559 .0932063 -4.37 0.000 -.591853 -.2232587 dinf_2 | -.3777853 .0977229 -3.87 0.000 -.5710131 -.1845574 dinf_3 | .0515292 .0798247 0.65 0.520 -.1063085 .2093669 dinf_4 | -.0260024 .0826179 -0.31 0.753 -.1893631 .1373584 unemp_1 | -2.705181 .6911244 -3.91 0.000 -4.071744 -1.338618 unemp_2 | 3.54704 1.300035 2.73 0.007 .9764752 6.117605 unemp_3 | -2.025859 1.188034 -1.71 0.090 -4.374964 .3232453 unemp_4 | .9846463 .5641419 1.75 0.083 -.1308334 2.100126 D | -.0729984 .9544203 -0.08 0.939 -1.960177 1.81418 D_unemp_1 | -.5718067 .8773241 -0.65 0.516 -2.306543 1.162929 D_unemp_2 | .1754026 1.576346 0.11 0.912 -2.941512 3.292317 D_unemp_3 | 2.79729 1.599601 1.75 0.083 -.3656069 5.960186 D_unemp_4 | -2.432152 .8388761 -2.90 0.004 -4.090865 -.7734395 _cons | 1.350888 .733964 1.84 0.068 -.100382 2.802157 -----------------------------------------------------------------------------. testparm D-D_unemp_4 ( ( ( ( ( 1) 2) 3) 4) 5) D = 0.0 D_unemp_1 D_unemp_2 D_unemp_3 D_unemp_4 F( = = = = 0.0 0.0 0.0 0.0 5, 138) = Prob > F = 3.31 0.0074 With 5 restrictions, the critical values of the QLR statistics are: 3.26, 3.66 and 4.53 at respectively the 10%, 5% and 1% confidence interval. So for 1981:4, we reject the null hypothesis that the coefficients on the dummy and interacted terms are all zero, therefore at the 10% level confidence interval, we conclude that there is a break in the series at that point for at least one of the 5 estimates. 81 7.8.3 Pseudo out of sample forecast 1) choose the number of observations P for which you will generate pseudo out of sample forecast, say P=10%. Let’s define s=T-P 2) Estimate the regression on the shortened sample: t=1,..,s ~ 3) Compute the forecast for the first period beyond the shortened sample: Ys 1|s ~ 4) The forecast error : u~s 1  Ys 1  Ys 1|s 5) Repeat steps 2-4 for each date from T-p+1 to T-1 (reestimating the regression each time) . 6) The pseudo forecast errors can be examined to see if they are consistent with a stationary relationship For example, going back to our prediction of inflation, using data up to 1993:4, we can predict inflation for 1994:1, doing so until 1999:4, we have 24 pseudo forecasts. Regression with robust standard errors Number of obs F( 13, 114) Prob > F R-squared Root MSE = = = = = 128 7.37 0.0000 0.4210 1.4729 -----------------------------------------------------------------------------| Robust dinf | Coef. Std. Err. t P>|t| [95% Conf. Interval] -------------+---------------------------------------------------------------dinf_1 | -.4190169 .0998416 -4.20 0.000 -.6168024 -.2212315 dinf_2 | -.3961329 .1031673 -3.84 0.000 -.6005065 -.1917593 dinf_3 | .039491 .0844715 0.47 0.641 -.1278463 .2068283 dinf_4 | -.0449508 .0860523 -0.52 0.602 -.2154198 .1255181 unemp_1 | -2.679112 .6980463 -3.84 0.000 -4.061936 -1.296288 unemp_2 | 3.465039 1.325757 2.61 0.010 .8387247 6.091353 unemp_3 | -1.987951 1.22184 -1.63 0.106 -4.408407 .4325056 unemp_4 | .9924426 .5769953 1.72 0.088 -.1505805 2.135466 D | .4808356 1.389741 0.35 0.730 -2.27223 3.233901 D_unemp_1 | -.9707623 .9465191 -1.03 0.307 -2.845809 .9042847 D_unemp_2 | .6794326 1.700203 0.40 0.690 -2.688656 4.047521 D_unemp_3 | 2.716406 1.821819 1.49 0.139 -.8926028 6.325415 D_unemp_4 | -2.525234 .9671997 -2.61 0.010 -4.441249 -.6092183 _cons | 1.414308 .7407146 1.91 0.059 -.0530417 2.881658 The inflation rate is predicted to rise by 1.9 percentage points. But the true value is 0.9, so our forecast error is –1 percentage points. 82 Doing this 24 times, we find that the average forecast error is 0.37 which is significantly different from 0 (t=-2.71). This suggests that the forecasts were biased over the period, systematically forecasting higher inflation. This would suggest that the model has been unstable (break). 83 7.9 Cointegration Sometimes, 2 or more series have the same stochastic trend in common. In this special case, regression analysis can reveal long-run relationships among time series variables. 7.9.1 Cointegration and error correction Series can move together so closely over the long run that they appear to have the same trend component. For example, the 3 months and 12months US interest rate. FYFF FYGM3 19.1 1.73 19591 20004 daten moreover, the spread between the two series does not appear to have a trend. 84 4 2 0 -2 20000 19900 19800 daten 19700 19600 The two series have a common stochastic trend, they are said to be cointegrated. . Suppose Xt and Yt are integrated of order 1. If there exist a coefficient  such that Yt  X t is integrated of order 0 (stationary), then the 2 series are said to be cointegrated with a cointegrating coefficient  . IF the 2 series are not integrated of the same order to start with, they cannot be cointegrated. Unit root testing can be extended to test for cointegration. If Xt and Yt are cointegrated, then Yt  X t is I(0) (the null hypothesis of a unit root is rejected) otherwise Yt  X t isI(1). * Testing for cointegration when  is known. In some cases, economic theory suggests a value of  . In this case a DF test on the series z t  Yt  X is conducted. In our example, let’s assume that theory suggest that  =1. There is no trend in dspread, so we simply estimate: . reg dspread spread_1 dspread_1 dspread_2 dspread_3 dspread_4 Source | SS df MS Number of obs = 85 163 -------------+-----------------------------Model | 20.0646226 5 4.01292452 Residual | 53.706531 157 .342079815 -------------+-----------------------------Total | 73.7711536 162 .455377491 F( 5, 157) Prob > F R-squared Adj R-squared Root MSE = = = = = 11.73 0.0000 0.2720 0.2488 .58488 -----------------------------------------------------------------------------dspread | Coef. Std. Err. t P>|t| [95% Conf. Interval] -------------+---------------------------------------------------------------spread_1 | -.2506278 .0719562 -3.48 0.001 -.3927548 -.1085007 dspread_1 | -.283247 .091436 -3.10 0.002 -.4638504 -.1026437 dspread_2 | .0230289 .0910197 0.25 0.801 -.1567521 .20281 dspread_3 | -.0599991 .0895151 -0.67 0.504 -.2368085 .1168102 dspread_4 | .048277 .0791148 0.61 0.543 -.1079897 .2045436 _cons | .1548892 .063015 2.46 0.015 .0304227 .2793557 ------------------------------------------------------------------------------ Lags AIC 4 -1.049 3 -1.059 2 -1.063 1 -1.072 So our preferred model is the one with 4 lagged values of dspread. The t-stat on spread_1 = -3.48, which is greater than the critical value (1% of the ADF) so we reject the null hypothesis that   0 , the series does not have a unit root, and is therefore I(0). The 2 interest rate series are cointegrated. * testing for cointegration when  is unknown. In general  is unknown, the cointegration coefficient must be estimated prior to testing for unit root. This preliminary step makes it necessary to use different critical values for the subsequent unit root test. Step 1: estimate Yt    X t   t (7.12) Step2: a Dickey Fuller t-test is used to test for unit root in the residuals from (1): ̂ t This procedure is called the Engle-Granger Augmented Dickey Fuller Test. Critical values for the EGADF are: Nbr of X in (7.12): 10% 5% 1% 1 -3.12 -3.41 -3.96 2 -3.52 -3.80 -4.36 Cointegrated variables 86 3 -3.84 -4.16 -4.73 4 -4.20 -4.49 -5.07 . reg dnu nu_1 dnu_1 dnu_2 dnu_3 dnu_4 Source | SS df MS -------------+-----------------------------Model | 31.2052888 5 6.24105775 Residual | 45.5212134 157 .289944035 -------------+-----------------------------Total | 76.7265022 162 .473620384 Number of obs F( 5, 157) Prob > F R-squared Adj R-squared Root MSE = = = = = = 163 21.53 0.0000 0.4067 0.3878 .53846 -----------------------------------------------------------------------------dnu | Coef. Std. Err. t P>|t| [95% Conf. Interval] -------------+---------------------------------------------------------------nu_1 | -.5739985 .1150186 -4.99 0.000 -.8011821 -.3468149 dnu_1 | -.1574595 .1139771 -1.38 0.169 -.3825858 .0676667 dnu_2 | .0752181 .1052652 0.71 0.476 -.1327006 .2831369 dnu_3 | .0053021 .0974368 0.05 0.957 -.1871541 .1977583 dnu_4 | .1237554 .0782992 1.58 0.116 -.0309003 .278411 _cons | .0016953 .0421806 0.04 0.968 -.0816193 .0850099 Reject the null hypothesis of a unit root, the two series are cointegrated. *Error correction model If 2 series are cointegrated, their time paths are influenced by any deviation from long-run equilibrium, if the system is in long run equilibrium, some of the variables must respond to the magnitude of the disequilibrium. For example, if the gap between short and long term interest rate is large, arbitragers will intervene on the market, so that the disequilibrium gap disappears. If 2 series are cointegrated, then the forecast of Yt and X t can be improved by including an error correction term. If Xt and Yt are cointegrated, one way to eliminate the stochastic trend is to compute the series Yt  X t which is stationary and can be used for analysis. The term Yt  X t is called the error correction term Yt   0   1 Yt 1  ....   p Yt  p   1 X t 1  ...   q X t  q   (Yt 1  X t 1 )  u t similarly, we also have: X t   0   1 Yt 1  ....   p Yt  p   1 X t 1  ...   q X t  q   (Yt 1  X t 1 )  u t if  is unknown, then the Error Correction Models can be estimated using ˆ t 1 . 87 Interest rate change according to stochastic shocks and previous period deviation from the longterm equilibrium Yt  X t =0. Alphas can be interpreted as the speed of adjustment. The absence of Granger causality for cointegrated variables requires that the speed of adjustment is 0 as well as all gammas (resp, all betas) to be 0. Of course at least one of the alphas has to be non-zero for the 2 series to be cointegrated. For yt to be I(0), Yt  X t needs to be I(0) since the error term and all first difference terms are I(0), hence the 2 series are cointegrated C(1,1). Engle and Granger cointegration procedure: 1- test the integration order of both series using DF, both series must be of the same order to stand a chance of being cointegrated. 2- Estimate the long-run relationship: Yt    X t   t Test ̂ for stationarity, if I(0) then y, and x are cointegrated. 3- Estimate the error correction model, since all terms are stationary, the usual test statistics apply. 4- Assess model adequacy 88 Back to our example, we have shown that by assuming  =1, we find that the 2 series were cointegrated. . reg dfyff dfyff_1-dfyff_4 dfy3m_1-dfy3m_4 spread_1 Source | SS df MS -------------+-----------------------------Model | 69.5220297 9 7.72466996 Residual | 249.54597 153 1.63101941 -------------+-----------------------------Total | 319.067999 162 1.96955555 Number of obs F( 9, 153) Prob > F R-squared Adj R-squared Root MSE = = = = = = 163 4.74 0.0000 0.2179 0.1719 1.2771 -----------------------------------------------------------------------------dfyff | Coef. Std. Err. t P>|t| [95% Conf. Interval] -------------+---------------------------------------------------------------dfyff_1 | -.0014132 .2136881 -0.01 0.995 -.4235733 .4207468 dfyff_2 | -.0264828 .2208415 -0.12 0.905 -.4627751 .4098095 dfyff_3 | .1002626 .2129522 0.47 0.638 -.3204438 .5209689 dfyff_4 | .1444413 .1802188 0.80 0.424 -.2115972 .5004798 dfy3m_1 | .0068489 .2541142 0.03 0.979 -.4951767 .5088745 dfy3m_2 | -.1758844 .275382 -0.64 0.524 -.7199263 .3681576 dfy3m_3 | .2220654 .2653096 0.84 0.404 -.3020777 .7462086 dfy3m_4 | -.3159166 .2272404 -1.39 0.166 -.7648506 .1330174 spread_1 | -.4598352 .1585354 -2.90 0.004 -.7730361 -.1466342 _cons | .2955998 .1381308 2.14 0.034 .0227098 .5684897 ------------------------------------------------------------------------------ the lag spread does help to predict change in interest rate in the one year treasure bond rate. 89 Chapter 9: Limited Dependent variable Limited dependent variables are variables that: - only take 2 values (working / not working), - have no specific order (black / white), (membership / no membership) - take a limited number of discrete values (number of children) - are not numerical ( mode of transport: walk, bus, cycle, car) 8.1 The linear probability model We are interested in the determinant of staying in post compulsory education among children aged 16-17. We observe whether they are currently receiving some schooling, but not the total amount of schooling they will receive. This outcome is clearly binary and we believe it to be a function of paternal pay. . reg stilled lndadpay,robust Regression with robust standard errors Number of obs F( 1, 15686) Prob > F R-squared Root MSE = = = = = 15688 398.77 0.0000 0.0267 .45686 -----------------------------------------------------------------------------| Robust stilled | Coef. Std. Err. t P>|t| [95% Conf. Interval] -------------+---------------------------------------------------------------lndadpay | .1250107 .0062602 19.97 0.000 .11274 .1372814 _cons | -.070091 .0388235 -1.81 0.071 -.1461895 .0060075 ------------------------------------------------------------------------------ So when dad log pay =5, still ed=0.55. How can we interpret this coefficient? If father has an pay of 5 log points, the probability of still being in education is estimated to be 55%. 90 still in education Fitted values still in education 1 .5 0 0 5 lndadpay 10 The population regression function is the expected value of Y given the regressors X: E (Y / X 1 , X 2 ,..., X k )  Pr(Y  1 / X 1 , X 2 ,.., X k ) For binary variables, the predicted value from the population regression is the probability that Y=1 given X. In the context of binary variable, this model is called the linear probability model.  1 is the change in the probability that Y=1 associated with a unit change in X1. Yˆ is the predicted probability that the dependent variable equals 1. So if father’s income increases from 5 to 6, the probability of staying in post compulsory education increases by 12.5 percentage points (statistically significant). This model has all the characteristics of OLS models, since the errors are always heteroskedastic, it is essential to use heteroskedasticity robust standard error in the regression. 91 Residuals .910254 -1.02793 0 5 lndadpay 10 Note: for linear probability model, the R2 is meaningless. In the continuous OLS model, it is possible to imagine a situation were all the points are exactly on the regression line and therefore R2 equals one. However, in the linear probability model this is not possible since Y can only take 0 or 1. What is the effect of gender on staying on, holding father’s income constant? Regression with robust standard errors Number of obs F( 2, 15685) Prob > F R-squared Root MSE = = = = = 15688 251.92 0.0000 0.0329 .45542 -----------------------------------------------------------------------------| Robust stilled | Coef. Std. Err. t P>|t| [95% Conf. Interval] -------------+---------------------------------------------------------------lndadpay | .1245059 .0062399 19.95 0.000 .1122751 .1367368 sex | -.0729526 .0072599 -10.05 0.000 -.0871829 -.0587224 _cons | -.0288025 .0388944 -0.74 0.459 -.1050401 .047435 ------------------------------------------------------------------------------ Males are 7 percentage points less likely to remain past compulsory schooling age. (statistically significant). Shortcoming of the linear probability model: 92 Looking back at our first figure, the linear model predicts value of the probability lower than 0 and greater than 1, which is non-sensical. Thus specific models to deal with limited dependent variables have been implemented. 9.2 Probit and Logit regression Probit and logit are non linear regressions. Because we constrain the dependent variable to be between 0 and 1, it makes sense to adopt a non-linear formulation forcing the predicted values to be between 0 and 1. Cumulative probability distribution functions (cdf) have such properties. Probit is based on the normal cdf Logit is based on the logistic cdf. Say that we have an underlying specification: Y '   0  1 X   Y’ is not observed, all we know is the following dichotomous variable Y, such that: Y  0 if Y'  Yc' Y  1 otherwise 9.2.1 Probit regression The probit model has the following form: Pr(Y  1 / X )   (  0   1 X ) (9.1) Where  is the cumulative normal distribution function. 93 logit probit 1 .5 0 -2 -4 0 temp 4 2 This function is therefore clearly non linear. This makes the interpretation of the estimated coefficients quite difficult. Coefficients from probit and logit models are obtained by maximum likelihood (see below), for the moment just pretend that we have estimated the following output. . probit stilled lndadpay Iteration 0: Iteration 1: Iteration 2: log likelihood = -9729.7596 log likelihood = -9517.2802 log likelihood = -9517.12 Probit estimates Log likelihood = -9517.12 Number of obs LR chi2(1) Prob > chi2 Pseudo R2 = = = = 15688 425.28 0.0000 0.0219 -----------------------------------------------------------------------------stilled | Coef. Std. Err. z P>|z| [95% Conf. Interval] -------------+---------------------------------------------------------------lndadpay | .3601607 .0175923 20.47 0.000 .3256805 .3946409 _cons | -1.681938 .106433 -15.80 0.000 -1.890543 -1.473333 All we can safely say here is that ln dad pay has a significant positive effect on the probability of staying on in education. To interpret coefficients, it is usually advocated to calculate the estimated (change in) probabilities. 94 1) calculate: z  ˆ 0  ˆ1 X 2) looking up the value of z, in the normal distribution table 3) redo step 1 and 2, with X+1. Here in this example, we find that, when log dad wage =5: z=-1.68+.361*5=0.125 P(Y=1/X)=  (0.125)  0.54 If dad’s pay is now 6, we find z=0.486  P=0.68 As in the OLS case, probit estimate will be biased if determinants of Y that are correlated with X are not included in the regression. So in general, you will estimate a multivariate probit. The difficulty comes with the interpretation of these coefficients. 1) calculate: z  ˆ 0  ˆ1 X 1  ˆ 2 X 2 . Say you are interested in the effect of X1 on Y. You will then fix X2 to a specific value (usually, it’s sample mean). 2) Look up the value for z in the normal distribution table 3) Change X1 to X1+1, but keep X2 to the fixed value used in step 1. 4) Look up the value for z Example: Iteration 0: Iteration 1: Iteration 2: log likelihood = -9729.7596 log likelihood = -9467.9643 log likelihood = -9467.6864 Probit estimates Log likelihood = -9467.6864 Number of obs LR chi2(2) Prob > chi2 Pseudo R2 95 = = = = 15688 524.15 0.0000 0.0269 -----------------------------------------------------------------------------stilled | Coef. Std. Err. z P>|z| [95% Conf. Interval] -------------+---------------------------------------------------------------lndadpay | .3599531 .017623 20.43 0.000 .3254127 .3944935 sex | -.2109057 .0212435 -9.93 0.000 -.2525422 -.1692692 _cons | -1.567503 .1071921 -14.62 0.000 -1.777596 -1.357411 ------------------------------------------------------------------------------ We are interested in the effect of gender on the probability of staying on. All we can say for the moments is that boys are significantly less likely to stay than girls. But by how much… Traditionally, we will estimate this effect for an individual with the mean characteristics on the other variables: su lndadpay if e(sample) Variable | Obs Mean Std. Dev. Min Max -------------+----------------------------------------------------lndadpay | 15688 6.069131 .6056018 1.278584 9.497148 so for girls we have: z=-1.567+.3599*6.069-.211*0 = 0.617 P(Y=1/girl)=73% whilst for boys: z=-1.567+.3599*6.069-.211*1 = 0.406 P(Y=1/boy)=66% At the dad mean wage, the difference in the probability of staying on between boys and girls is 7 percentage points If we had estimated these probabilities for poorer families (say dad ln pay = 3), then we would have concluded: z=-1.567+.3599*3-.211*0 = -0.487 P(Y=1/girl)=31% z=-1.567+.3599*3-.211*1 = -0.698 P(Y=1/boy)=24% That the difference is 7 percentage points but from a lower base. The gender dummy shifts the probability of staying on in post compulsory education. 9.2.2 Logit regression the logit model is similar to the probit model, with the exception that the cumulative normal distribution is replaced by the cumulate logistic distribution. The logistic distribution is: 96 Pr(Y  1 / X 1 , X 2 ,.., X k )  F (  0   1 X 1  ...   k X k )  1 1  exp[ (  0   1 X 1  ...   k X k ) As for the probit, the coefficients of the logit model are estimated by maximum likelihood. The ML estimator is consistent and normally distributed in large samples, so that the t-statistics and confidence intervals can be constructed the usual way. Once again, coefficients have no easy interpretation and change in the probabilities must be calculated to be informative. Probit and logit models give almost identical estimates apart from the tails of the distribution. It can be shown that the coefficients of probit and logit models have the following relation:  l  1.66  p Logit estimates Number of obs LR chi2(2) Prob > chi2 Pseudo R2 Log likelihood = -9471.3283 = = = = 15688 516.86 0.0000 0.0266 -----------------------------------------------------------------------------stilled | Coef. Std. Err. z P>|z| [95% Conf. Interval] -------------+---------------------------------------------------------------lndadpay | .5911763 .0300632 19.66 0.000 .5322536 .650099 sex | -.3515223 .035303 -9.96 0.000 -.420715 -.2823297 _cons | -2.584355 .1822928 -14.18 0.000 -2.941642 -2.227068 ------------------------------------------------------------------------------ so the estimated staying on probability for a girl with a dad at the average log pay is: pr(Y=1/girl)=F(-2.584+.591*6.07)=F(1.003) Pr(Y=1/girl)=1/(1+exp(-1.003))=73% as was found using the probit model. While probit are easier to understand since they rely on the normal distribution, logit model were quicker to calculate, which explained their popularity. 9.3 Estimation of logit and probit models Probit and logit are non linear and therefore cannot be estimated using OLS. Instead they are estimated using maximum likelihood estimation. The likelihood function is the joint probability distribution of the data, treated as a function of the unknown coefficients. The maximum likelihood estimator (MLE) consists of the values of the coefficients that maximize the likelihood function. MLS chooses the unknown coefficients to maximize the likelihood function, MLE chooses the values of the parameters that maximize the 97 probability of drawing the data that are actually observed. MLE are the parameters values most likely to have generated the data. 9.3.1 MLE for n iid Bernouilli random variables Say that we have n observations on a Bernouilli random variable, because processes are independently distributed, the joint distribution is the product of the individual distributions. Thus: Pr(Y1   1 , Y2   2 ,..., Yn   n )  Pr(Y1   1 )... Pr(Yn   n ) The Bernouilli distribution is such that: Pr(Y   )  p  (1  p)1  [ p  1 (1  p) (1 1 ) ] *[ p  2 (1  p) (1 2 ) ] * ... *[ p  n (1  p) (1 n ) Hence:  p ( 1 ...  n ) (1  p) n( 1 ...  n ) (9.3) The maximum likelihood estimator of p is the value of p that maximizes the likelihood function in (9.3). n Let S   Yi , then the likelihood function is: i 1 f Bernouilli( p, Y1 ,.., Yn 0  p S (1  p) n S It is usually easier to maximize the log likelihood, which gives the same maximum since log is a monotonic function, so the log likelihood is: log f ( p, Y1 ,..Yn )  S ln( p)  (n  S ) ln( 1  p) And its derivative is: d S nS ln( f )   dp p 1 p A maximum is obtained when this derivative equals 0, so we get pˆ MLE  S / n 9.3.2 MLE probit and logit Assuming the n observations are independent, the likelihood function is given by: n n i 1 i 1 ln( f probit )   Yi ln[  (  0  1 X 1i  ...   k X ki ]   (1  Yi ) ln[ 1   (  0  1 X 1i  ...   k X ki ] There is no easy way to solve this equation, so the probit likelihood function must be maximized using a numerical algorithm. Similarly, the logit likelihood function must be maximized numerically. 98 The goodness of fit of MLE is given by the pseudo R2 pseudoR  1  2 max ln( f probit ) max ln( f bernouilli ) max max Where f probit is the value of the maximized probit likelihood and f bernouilli is the value of the maximized Bernouilli likelihood (a probit model excluding all Xs). 99

New York University - Local Governance Research Laboratory

Related documents

Products

Support

New York University - Local Governance Research Laboratory

Related documents

Add this document to collection(s)

Add this document to saved

Suggest us how to improve StudyLib