ECONOMETRIC OUTLINE.. RANDOM VOCAB ● Polynomial models- are a great tool for determining which input factors drive responses and in what direction. ... A quadratic (second-order) polynomial model for two explanatory variables has the form of the equation below. The single x-terms are called the main effects. ● The F-test for overall significance has the following two hypotheses: ○ The null hypothesis states that the model with no independent variables fits the data as well as your model. ○ The alternative hypothesis says that your model fits the data better than the intercept-only model. ○ Compare the p-value for the F-test to your s ignificance level. If the p-value is less than the significance level, your sample data provide sufficient evidence to conclude that your regression model fits the data better than the model with no independent variables. ○ This finding is good news because it means that the independent variables in your model improve the fit! ● Latent variables - (“lie hidden”), as opposed to observable variables) are variables that are not directly observed but are rather inferred (through a mathematical model) from other variables that are observed (directly measured). ● Endogenous variables are used in econometrics and sometimes in linear regression. They are similar to (but not exactly the same as) dependent variables. Endogenous variables have values that are determined by other variables in the system (these “other” variables are called exogenous variables). ○ Example: Let’s suppose a manufacturing plant produces a certain amount of white sugar. The amount of product (white sugar) is the endogenous variable and is dependent on any number of other variables which may include weather, pests, price of fuel etc. As the amount of sugar is entirely dependent on the other factors in the system, it’s said to be purely endogenous. However, in real life purely endogenous variables are a rarity; it’s more likely that endogenous variables are only partially determined by exogenous factors. For example, sugar production is affected by pests, and pests are affected by weather. Therefore, pests in this particular system are partially endogenous and partially exogenous. ○ Seasonality is a characteristic of a time series in which the data experiences regular and predictable changes that recur every calendar year. Any predictable fluctuation or pattern that recurs or repeats over a one-year period is said to be seasonal. ○ spurious relationship or spurious correlation is a mathematical relationship in which two or more events or variables are associated but not causally related, due to either coincidence or the presence of a certain third, unseen factor (referred to as a "common response variable", "confounding factor", or "lurking variable"). ○ An instrumental variable (sometimes called an “instrument” variable) is a third variable, Z, used in regression analysis when you have endogenous variables—variables that are influenced by other variables in the model. In other words, you use it to account for unexpected behavior between variables. Using an instrumental variable to identify the hidden (unobserved) correlation allows you to see the true correlation between the explanatory variable and response variable, Y. ■ Example: Let’s say you had two correlated variables that you wanted to regress: X and Y. Their correlation might be described by a third variable Z, which is associated with X in some way. Z is also associated with Y but only through Y’s direct association with X. For example, let’s say you wanted to investigate the link between depression (X) and smoking (bY). Lack of job opportunities (Z) could lead to depression, but it is only associated with smoking through it’s association with depression (i.e. there isn’t a direct correlation between lack of job opportunities and smoking). This third variable, Z (lack of job opportunities), can generally be used as an instrumental variable if it can be measured and it’s behavior can be accounted for. The ordinary least squares (OLS) technique ● is the most popular method of performing regression analysis and estimating econometric models, because in standard situations (meaning the model satisfies a series of statistical assumptions) it produces optimal (the best possible) results. ● Classical linear regression model (CLRM) assumptions are the following ○ The model parameters are linear, meaning the regression coefficients don’t enter the function being estimated as exponents (although the variables can have an exponent ○ The values for the independent variables are derived from a random sample of the population, and they contain variability ○ The explanatory variables don’t have perfect collinearity (that is, no independent variable can be expressed as a linear function of any other independent variables) ○ The error term has zero conditional means, meaning that the average error is zero at any specific value of the independent variable ○ The model has no heteroskedasticity (meaning the variance of the error is the same regardless of the independent variable’s value). ○ The model has no autocorrelation (the error term doesn’t exhibit a systematic relationship over time). ● We have already seen in Chapters 3 and 5 that omitting a key variable can cause correlation between the error and some of the explanatory variables, which generally leads to bias and inconsistency in all of the OLS estimators. In the special case that the omitted variable is a function of an explanatory variable in the model, the model suffers from functional form misspecification ● In Chapter 8, we dealt with one failure of the Gauss-Markov assumptions. While heteroskedasticity in the errors can be viewed as a problem with a model, it is a relatively minor one. The presence of heteroskedasticity does not cause bias or inconsistency in the OLS estimators. Also, it is fairly easy to adjust confidence intervals and t and F statistics to obtain valid inference after OLS estimation, or even to get more efficient estimators by using weighted least squares. CHAPTER 9 ● In this chapter, we return to the much more serious problem of correlation between the error, u, and one or more of the explanatory variables. Remember from Chapter 3 that if u is, for whatever reason, correlated with the explanatory variable xj , then we say that xj is an endogenous explanatory variable. We also provide a more detailed discussion on three reasons why an explanatory variable can be endogenous; in some cases, we discuss possible remedies.. ● We have already seen in Chapters 3 and 5 that omitting a key variable can cause correlation between the error and some of the explanatory variables, which generally leads to bias and inconsistency in all of the OLS estimators. In the special case that the omitted variable is a function of an explanatory variable in the model, the model suffers from functional form misspecification. ● In Section 9-3, we derive and explain the bias in OLS that can arise under certain forms of measurement error. Additional data problems are discussed in Section 9-4 ● All of the procedures in this chapter are based on OLS estimation. As we will see, certain problems that cause correlation between the error and some explanatory variables cannot be solved by using OLS on a single cross section. We postpone a treatment of alternative estimation methods until Part 3. ○ FUNCTIONAL FORM MISPECIFICATION ■ We may have a model that is correctly specified, in terms of including the appropriate ex- planatory variables, yet commit functional form misspecification–in which the model does not properly account for the relationship between dependent and observed explanatory variables. ■ We may, of course, use the tools already de- veloped to deal with these problems, in the sense that if we first estimate a general model that allows for powers, interaction terms, etc. and then “test down” with joint F tests, we can be confident that the more specific model we develop will not have imposed inappropri- ate restrictions along the way. But how can we consider the possibility that there are missing elements even in the context of our general model? ■ One quite useful approach to a general test for functional form misspecification is Ramsey’sRESET (regression specification error test). ● The idea behind RESET is quite simple; if we have properly specified the model, no nonlinear functions of the independent variables should be significant when added to our estimated equation. ● The RESET formulation re-estimates the original equation, augmented by powers of yˆ (usually squares, cubes, and fourth powers are sufficient) and conducts an F-test for the joint null hypothesis that those variables have no significant explanatory power. ● RESET should not be considered a general test for omission of relevant variables; it is a test for misspecification of the relationship between y and the x values in the model, and nothing more. ■ ○ Proxy Variables ■ What is a proxy variable? What are the conditions for a proxy variable to be valid in regression analysis? A proxy variable is one that is used to represent the influence of an unobserved (and important) explanatory variable. There are two conditions for the validity of a proxy. ● The first is that the zero conditional mean assumption holds for all explanatory variables (including the unobserved and the proxy). ● The second is that the conditional mean of the unobserved, given other explanatory variables and the proxy, only depends on the proxy ■ For instance, admissions officers use SAT scores and high school GPAs as proxies for applicants’ ability and intelligence. No one argues that standardized tests or grade point averages are actually measuring aptitude, or intelligence; but there are reasons to believe that the observable variable is well correlated with the unobservable, or latent, variable. To what extent will a model estimated using such proxies for the variables in the underlying relationship be successful, in terms of delivering consistent estimates of its parameters? First, of course, it must be established that there is a correlation between the observable variable and the latent variable….. CHAPTER 10 (TIME SERIES DATA) ● Chapter 10 covers basic regression analysis and gives attention to problems unique to time series data. We provide a set of Gauss-Markov and classical linear model assumptions for time series applications. The problems of functional form, dummy variables, trends, and seasonality are also discussed. ○ What is a time series? Merely a sequence of observations on some phenomenon observed at regular intervals. Those intervals may correspond to the passage of calendar time (e.g. annual, quarterly, monthly data) or they may reflect an economic process that is irregular in calendar time (such as business-daily data). In either case, our observations may not be avail- able for every point in time (for instance, there are days when a given stock does not trade on the exchange). ○ We often speak of a time series as a stochastic process, or time series process, focusing on the concept that there is some mechanism generating that process, with a random component. ○ TYPES OF TIME SERIES REGRESSION MODELS ■ Static Model- (simplest) ● Each observation is modeled as depending only on contemporaneous (existing, occurring, or originating during the same time social and political events that were contemporaneous with each other) values of the explanatory variables. ● In many contexts, we find a static model in- adequate to reflect what we consider to be the relationship between explanatory variables and those variables we wish to explain. ● If we were to model capital investment with a static model, we would be omitting relevant explana- tory variables: the prior values of the causal factors. These omissions would cause our es- timates of the static model to be biased and inconsistent. Thus, we must use some form of distributed lag model to express the relationship between current and past values of the explanatory variables and the outcome. ● CONCERNS ○ However, the analysis of individual lag coefficients is often hampered–especially at higher frequencies such as quarterly and monthly data–by high autocorrelation in the series. That is, the values of the series are closely related to each other over time. If this is the case, then many of the individual coefficients in a FDL regression model may not be distinguishable from zero. This does not imply, though, that the sum of those coefficients (i.e. the long run multiplier) will be imprecisely estimated. We may get a very precise value for that effect, even if its components are highly intercorrelated. ○ in Stata, we use the tsset command to identify the date variable, which must contain the calendar dates over which the data are mea- sured), and construct lags and first differences taking these constraints into account (for in- stance, a lagged value of a variable will be set to a missing value where it is not available). ● Time trend is a variable which is equal to the time index in a given year (if your sample includes years 2000-2010 than time trend variable equals 1 for 2000, 2 for 2001 etc.). It allows to control for the exogenous increase in the dependent variable which is not explained by other variables. ● Spurious - spurious relationship or spurious correlation is a mathematical relationship in which two or more events or variables are associated but not causally related, due to either coincidence or the presence of a certain third, unseen factor CHAPTER 11/12: Further Issues in Using OLS with Time Series Data; Serial Correlation and Heteroskedasticity in Time Series regressions ● Covariate- are characteristics (excluding the actual t reatment) of the participants in an experiment. If you collect data on characteristics before you run an experiment, you could use that data to see how your treatment affects different groups or populations. Or, you could use that data to control for the influence of any covariate. It can be an independent variable (i.e. of direct interest) or it can be an unwanted, c onfounding variable. Adding a covariate to a model can increase the accuracy of your results. ● SERIAL CORRELATION and Durbin Watson test ○ Test for first order serial correlation, allowed to use when cross-sectional model that does not use as an independent variable as a lagged version of dependent variable. ○ Less than 2 ?? reject null .. there is serial correlation… ● Test for HETEROSKEDASTICITY ○ Breusch-Pagan Test -The Breusch-Pagan test is designed to detect any linear form of heteroskedasticity ○ ○ Breusch-Pagan / Cook-Weisberg tests the null hypothesis that the error variances are all equal versus the alternative that the error variances are a multiplicative function of one or more variables. For example, in the default form of the hettest command shown above, the alternative hypothesis states that the error variances increase (or decrease) as the predicted values of Y increase, e.g. the bigger the predicted value of Y, the bigger the error variance is. A large chi-square would indicate that heteroskedasticity was present. In this example, the chi- square value was small, indicating heteroskedasticity was probably not a problem (or at least that if it was a problem, it wasn’t a multiplicative function of the predicted values). Endogenous variable: A factor in a causal model or causal system whose value is determined by the states of other variables in the system; contrasted with an exogenous variable. Related but non-equivalent distinctions are those between dependent and independent variables and between explanandum and explanans. A factor can be classified as endogenous or exogenous only relative to a specification of a model representing the causal relationships producing the outcome y among a set of causal factors X (x1, x 2, �, xk) (y = M(X)). A variable xj is said to be endogenous within the causal model M if its value is determined or influenced by one or more of the independent variables X (excluding itself). A purely endogenous variable is a factor that is entirely determined by the states of other variables in the system. (If a factor is purely endogenous, then in theory we could replace the occurrence of this factor with the functional form representing the composition of xj as a function of X.) In real causal systems, however, there can be a range of endogeneity. Some factors are causally influenced by factors within the system but also by factors not included in the model. So a given factor may be partially endogenous and partially exogenous—partially but not wholly determined by the values of other variables in the model. Consider a simple causal system—farming. The outcome we are interested in explaining (the dependent variable or the explanandum) is crop output. Many factors (independent variables, explanans) influence crop output: labor, farmer skill, availability of seed varieties, availability of credit, climate, weather, soil quality and type, irrigation, pests, temperature, pesticides and fertilizers, animal practices, and availability of traction. These variables are all causally relevant to crop yield, in a specifiable sense: if we alter the levels of these variables over a series of tests, the level of crop yield will vary as well (up or down). These factors have real causal influence on crop yield, and it is a reasonable scientific problem to attempt to assess the nature and weight of the various factors. We can also notice, however, that there are causal relations among some but not all of these factors. For example, the level of pest infestation is influenced by rainfall and fertilizer (positively) and pesticide, labor, and skill (negatively). So pest infestation is partially endogenous within this system—and partially exogenous, in that it is also influenced by factors that are external to this system (average temperature, presence of pest vectors, decline of predators, etc.). The concept of endogeneity is particularly relevant in the context of time series analysis of causal processes. It is common for some factors within a causal system to be dependent for their value in period n on the values of other factors in the causal system in period n-1. Suppose that the level of pest infestation is independent of all other factors within a given period, but is influenced by the level of rainfall and fertilizer in the preceding period. In this instance it would be correct to say that infestation is exogenous within the period, but endogenous over time. Exogenous variable (see also endogenous variable): A factor in a causal model or causal system whose value is independent from the states of other variables in the system; a factor whose value is determined by factors or variables outside the causal system under study. For example, rainfall is exogenous to the causal system constituting the process of farming and crop output. There are causal factors that determine the level of rainfall—so rainfall is endogenous to a weather model—but these factors are not themselves part of the causal model we use to explain the level of crop output. As with endogenous variables, the status of the variable is relative to the specification of a particular model and causal relations among the independent variables. An exogenous variable is by definition one whose value is wholly causally independent from other variables in the system. So the category of �exogenous� variable is contrasted to those of �purely endogenous� and �partially endogenous� variables. A variable can be made endogenous by incorporating additional factors and causal relations into the model. There are causal and statistical interpretations of exogeneity. The causal interpretation is primary, and defines exogeneity in terms of the factor�s causal independence from the other variables included in the model. The statistical or econometric concept emphasizes non-correlation between the exogenous variable and the other independent variables included in the model. If xj is exogenous to a matrix of independent variables X (excluding xj), then if we perform a regression of xj against X (excluding xj), we should expect coefficients of 0 for each variable in X (excluding xj). Normal regression models assume that all the independent variables are exogenous.