Uploaded by Cameron Cardona

Econometrics

advertisement
ECONOMETRIC OUTLINE..
RANDOM VOCAB
● Polynomial models​- are a great tool for determining which input factors drive
responses and in what direction. ... A quadratic (second-order) polynomial model
for two explanatory variables has the form of the equation below. The single
x-terms are called the main effects.
● The​ F-test for overall significance​ has the following two hypotheses:
○ The null hypothesis states that the model with no independent variables fits
the data as well as your model.
○ The alternative hypothesis says that your model fits the data better than the
intercept-only model.
○ Compare the ​p-value​ for the F-test to your s​ ignificance level​. If the p-value
is less than the significance level, your ​sample​ data provide sufficient
evidence to conclude that your regression model fits the data better than the
model with no independent variables.
○ This finding is good news because it means that the independent variables
in your model improve the fit!
● Latent variables ​- (“lie hidden”), as opposed to observable variables) are
variables that are not directly observed but are rather inferred (through a
mathematical model) from other variables that are observed (directly measured).
● Endogenous variables​ are used in ​econometrics​ and sometimes in ​linear
regression​. They are similar to (but not exactly the same as) ​dependent variables​.
Endogenous variables have values that are determined by other variables in the
system (these “other” variables are called exogenous variables).
○ Example: ​Let’s suppose a manufacturing plant produces a certain amount
of white sugar. The amount of product (white sugar) is the endogenous
variable and is dependent on any number of other variables which may
include weather, pests, price of fuel etc. As the amount of sugar is entirely
dependent on the other factors in the system, it’s said to be purely
endogenous. However, in real life purely endogenous variables are a
rarity; it’s more likely that endogenous variables are only partially
determined by exogenous factors. For example, sugar production is
affected by pests, and pests are affected by weather. Therefore, pests in this
particular system are partially endogenous and partially exogenous.
○ Seasonality​ is a characteristic of a time series in which the data experiences
regular and predictable changes that recur every calendar year. Any
predictable fluctuation or pattern that recurs or repeats over a one-year
period is said to be seasonal.
○ spurious relationship or spurious correlation ​is a ​mathematical
relationship​ in which two or more events or variables are ​associated​ but ​not
causally related​, due to either coincidence or the presence of a certain third,
unseen factor (referred to as a "common response variable", "confounding
factor", or "​lurking variable​").
○ An​ instrumental variable ​(sometimes called an “instrument” variable) is a
third ​variable​, Z, used in regression analysis when you have ​endogenous
variables​—variables that are influenced by other variables in the model. In
other words, you use it to account for unexpected behavior between
variables. Using an instrumental variable to identify the hidden
(unobserved) correlation allows you to see the true correlation between the
explanatory variable and ​response variable​, Y.
■ Example​: Let’s say you had two ​correlated ​variables that you
wanted to regress: X and Y. Their correlation might be described by
a third variable Z, which is associated with X in some way. Z is also
associated with Y but only through Y’s direct association with X. For
example, let’s say you wanted to investigate the link between
depression (X) and smoking (bY). Lack of job opportunities (Z) could
lead to depression, but it is only associated with smoking through
it’s association with depression (i.e. there isn’t a direct correlation
between lack of job opportunities and smoking). This third variable,
Z (lack of job opportunities), can generally be used as an
instrumental variable if it can be measured and it’s behavior can be
accounted for.
The ordinary least squares (OLS) technique
●
is the most popular method of performing regression analysis and estimating
econometric models, because in standard situations (meaning the model satisfies a
series of statistical assumptions) it produces optimal (the best possible) results.
● Classical linear regression model (CLRM) assumptions are the following
○ The model parameters are linear, meaning the regression coefficients don’t
enter the function being estimated as exponents (although the variables can
have an exponent
○ The values for the independent variables are derived from a random sample
of the population, and they contain variability
○ The explanatory variables don’t have perfect collinearity (that is, no
independent variable can be expressed as a linear function of any other
independent variables)
○ ​The error term has zero conditional means, meaning that the average error
is zero at any specific value of the independent variable
○ The model has no heteroskedasticity (meaning the variance of the error is
the same regardless of the independent variable’s value).
○ The model has no autocorrelation (the error term doesn’t exhibit a
systematic relationship over time).
● We have already seen in Chapters 3 and 5 that omitting a key variable can cause
correlation between the error and some of the explanatory variables, which
generally leads to bias and inconsistency in all of the OLS estimators. In the
special case that the omitted variable is a function of an explanatory variable in the
model, the model suffers from functional form misspecification
● In Chapter 8, we dealt with one failure of the Gauss-Markov assumptions. While
heteroskedasticity in the errors can be viewed as a problem with a model, it is a
relatively minor one. The presence of heteroskedasticity does not cause bias or
inconsistency in the OLS estimators. Also, it is fairly easy to adjust confidence
intervals and t and F statistics to obtain valid inference after OLS estimation, or
even to get more efficient estimators by using weighted least squares.
CHAPTER 9 ● In this chapter, we return to the much more serious problem of correlation between
the error, u, and one or more of the explanatory variables.​ Remember from
Chapter 3 that if u is, for whatever reason, correlated with the explanatory
variable xj , then we say that xj is an​ endogenous explanatory variable​.​ We also
provide a more detailed discussion on three reasons why an explanatory variable
can be endogenous; in some cases, we discuss possible remedies..
● We have already seen in Chapters 3 and 5 that omitting a key variable can cause
correlation between the error and some of the explanatory variables, which
generally leads to bias and inconsistency in all of the OLS estimators. ​In the
special case that the omitted variable is a function of an explanatory variable in
the model, the model suffers from ​functional form misspecification.
● In Section 9-3, we derive and explain the bias in OLS that can arise under certain
forms of ​measurement error​. Additional data problems are discussed in Section
9-4
● All of the procedures in this chapter are based on OLS estimation. As we will see,
certain problems that cause correlation between the error and some explanatory
variables cannot be solved by using OLS on a single cross section. We postpone a
treatment of alternative estimation methods until Part 3.
○ FUNCTIONAL FORM MISPECIFICATION
■ We may have a model that is correctly specified, in terms of
including the appropriate ex- planatory variables, yet commit
functional form misspecification–in which the model does not
properly account for the relationship between dependent and
observed explanatory variables.
■ We may, of course, use the tools already de- veloped to deal with
these problems, in the sense that if we first estimate a general model
that allows for powers, interaction terms, etc. and then “test down”
with joint F tests, we can be confident that the more specific model
we develop will not have imposed inappropri- ate restrictions along
the way. But how can we consider the possibility that there are missing elements even in the context of our general model?
■ One quite useful approach to a general test for functional form
misspecification is ​Ramsey’sRESET (regression specification
error test).
● The idea behind RESET is quite simple; if we have properly
specified the model, no nonlinear functions of the
independent variables should be significant when added to
our estimated equation.
● The RESET formulation re-estimates the original equation,
augmented by powers of yˆ (usually squares, cubes, and
fourth powers are sufficient) and conducts an F-test for the
joint null hypothesis that those variables have no significant
explanatory power.
● RESET should not be considered a general test for omission
of relevant variables; it is a test for misspecification of the
relationship between y and the x values in the model, and
nothing more.
■
○ Proxy Variables
■ What is a proxy variable? What are the conditions for a proxy
variable to be valid in regression analysis? A​ proxy variable is one
that is used to represent the influence of an unobserved (and
important) explanatory variable. There are two conditions for
the validity of a proxy.
● The first is that the zero conditional mean assumption holds
for all explanatory variables (including the unobserved and
the proxy).
● The second is that the conditional mean of the unobserved,
given other explanatory variables and the proxy, only depends
on the proxy
■ For instance, admissions officers use SAT scores and high school
GPAs as proxies for applicants’ ability and intelligence. No one
argues that standardized tests or grade point averages are actually
measuring aptitude, or intelligence; but there are reasons to believe
that the observable variable is well correlated with the unobservable,
or latent, variable. To what extent will a model estimated using such
proxies for the variables in the underlying relationship be successful,
in terms of delivering consistent estimates of its parameters? First, of
course, it must be established that there is a correlation between the
observable variable and the latent variable…..
CHAPTER 10 (TIME SERIES DATA)
● Chapter 10 covers basic regression analysis and gives attention to problems unique
to ​time series data. ​We provide a set of Gauss-Markov and classical linear model
assumptions for time series applications. The problems of functional form, dummy
variables, trends, and seasonality are also discussed​.
○ What is a ​time series​? Merely a sequence of observations on some
phenomenon observed at regular intervals. Those intervals may correspond to the passage of calendar time (e.g. annual, quarterly, monthly
data) or they may reflect an economic process that is irregular in calendar
time (such as business-daily data). In either case, our observations may not
be avail- able for every point in time (for instance, there are days when a
given stock does not trade on the exchange).
○ We often speak of a time series as a ​stochastic process​, or time series
process, focusing on the concept that there is some mechanism generating
that process, with a random component.
○ TYPES OF TIME SERIES REGRESSION MODELS
■ Static Model- (simplest)
● Each observation is modeled as depending only on
contemporaneous​ (existing, occurring, or originating during
the same time social and political events that were
contemporaneous with each other)​ ​ values of the explanatory
variables.
● In many contexts, we find a static model in- adequate to
reflect what we consider to be the relationship between
explanatory variables and those variables we wish to explain.
● If we were to model capital investment with a static model,
we would be omitting relevant explana- tory variables: the
prior values of the causal factors. These omissions would
cause our es- timates of the static model to be biased and
inconsistent. Thus, we must use some form of distributed ​lag
model​ to ​express the relationship between current and past
values of the explanatory variables and the outcome.
● CONCERNS
○ However, the analysis of individual lag coefficients is
often hampered–especially at higher frequencies such
as quarterly and monthly data–by high autocorrelation
in the series. That is, the values of the series are
closely related to each other over time. If this is the
case, then many of the individual coefficients in a FDL
regression model may not be distinguishable from
zero. This does not imply, though, that the sum of
those coefficients (i.e. the long run multiplier) will be
imprecisely estimated. We may get a very precise
value for that effect, even if its components are highly
intercorrelated.
○ in Stata, we use the tsset command to identify the date
variable, which must contain the calendar dates over
which the data are mea- sured), and construct lags and
first differences taking these constraints into account
(for in- stance, a lagged value of a variable will be set
to a missing value where it is not available).
● Time trend is a variable which is equal to the time index in a
given year (if your sample includes years 2000-2010 than
time trend variable equals 1 for 2000, 2 for 2001 etc.). It
allows to control for the exogenous increase in the dependent
variable which is not explained by other variable​s.
● Spurious ​- spurious relationship or spurious correlation is a
mathematical relationship in which two or more events or
variables are associated but not causally related, due to either
coincidence or the presence of a certain third, unseen factor
CHAPTER 11/12: Further Issues in Using OLS with Time Series Data; Serial Correlation
and Heteroskedasticity in Time Series regressions
● Covariate-​ are characteristics (excluding the actual t​ reatment)​ of the participants
in an experiment. If you collect data on characteristics before you run an
experiment, you could use that data to see how your treatment affects different
groups or ​populations​. Or, you could use that data to control for the influence of
any covariate. It can be an ​independent variable​ (i.e. of direct interest) or it can
be an unwanted, c​ onfounding variable.​ Adding a covariate to a model can
increase the ​accuracy ​of your results.
● SERIAL CORRELATION and Durbin Watson test
○ Test for first order serial correlation, allowed to use when cross-sectional
model that does not use as an independent variable as a lagged version of
dependent variable.
○ Less than 2 ?? reject null .. there is serial correlation…
● Test for HETEROSKEDASTICITY
○ Breusch-Pagan Test​ -​The Breusch-Pagan test is designed to detect any linear
form of heteroskedasticity
○
○ Breusch-Pagan / Cook-Weisberg tests the null hypothesis that the error variances
are all equal versus the alternative that the error variances are a multiplicative
function of one or more variables. For example, in the default form of the hettest
command shown above, ​the alternative hypothesis states that the error variances
increase (or decrease) as the predicted values of Y increase, e.g. the bigger the
predicted value of Y, the bigger the error variance is​. ​A large chi-square would
indicate that heteroskedasticity was present. In this example, the chi- square
value was small, indicating heteroskedasticity was probably not a problem
(or at least that if it was a problem, it wasn’t a multiplicative function of the
predicted values).
Endogenous variable: A factor in a causal model or causal system whose value is determined
by the states of other variables in the system; contrasted with an ​exogenous variable​. Related
but non-equivalent distinctions are those between dependent and independent variables and
between explanandum and explanans. A factor can be classified as endogenous or exogenous
only relative to a specification of a model representing the causal relationships producing the
outcome ​y among a set of causal factors ​X (x​1​, x​​ 2​, �, x​k​) (y = M(X)). A variable ​x​j is said to be
endogenous within the causal model M if its value is determined or influenced by one or more of
the independent variables ​X ​(excluding itself). A purely endogenous variable is a factor that is
entirely determined by the states of other variables in the system. (If a factor is purely
endogenous, then in theory we could replace the occurrence of this factor with the functional
form representing the composition of x​j as a function of X.) In real causal systems, however,
there can be a range of endogeneity. Some factors are causally influenced by factors within the
system but also by factors not included in the model. So a given factor may be partially
endogenous and partially exogenous—partially but not wholly determined by the values of other
variables in the model.
Consider a simple causal system—farming. The outcome we are interested in explaining (the
dependent variable or the explanandum) is crop output. Many factors (independent variables,
explanans) influence crop output: labor, farmer skill, availability of seed varieties, availability of
credit, climate, weather, soil quality and type, irrigation, pests, temperature, pesticides and
fertilizers, animal practices, and availability of traction. These variables are all causally relevant
to crop yield, in a specifiable sense: if we alter the levels of these variables over a series of tests,
the level of crop yield will vary as well (up or down). These factors have real causal influence
on crop yield, and it is a reasonable scientific problem to attempt to assess the nature and weight
of the various factors. We can also notice, however, that there are causal relations among some
but not all of these factors. For example, the level of pest infestation is influenced by rainfall and
fertilizer (positively) and pesticide, labor, and skill (negatively). So pest infestation is partially
endogenous within this system—and partially exogenous, in that it is also influenced by factors
that are external to this system (average temperature, presence of pest vectors, decline of
predators, etc.).
The concept of endogeneity is particularly relevant in the context of time series analysis of
causal processes. It is common for some factors within a causal system to be dependent for their
value in period ​n on the values of other factors in the causal system in period ​n-1​. Suppose that
the level of pest infestation is independent of all other factors within a given period, but is
influenced by the level of rainfall and fertilizer in the preceding period. In this instance it would
be correct to say that infestation is exogenous within the period, but endogenous over time.
Exogenous variable​ (see also ​endogenous variable​): A factor in a causal model or causal
system whose value is independent from the states of other variables in the system; a factor
whose value is determined by factors or variables outside the causal system under study. For
example, rainfall is exogenous to the causal system constituting the process of farming and crop
output. There are causal factors that determine the level of rainfall—so rainfall is endogenous to
a weather model—but these factors are not themselves part of the causal model we use to explain
the level of crop output. As with endogenous variables, the status of the variable is relative to
the specification of a particular model and causal relations among the independent variables. An
exogenous variable is by definition one whose value is wholly causally independent from other
variables in the system. So the category of �exogenous� variable is contrasted to those of
�purely endogenous� and �partially endogenous� variables. A variable can be made
endogenous by incorporating additional factors and causal relations into the model. There are
causal and statistical interpretations of exogeneity. The causal interpretation is primary, and
defines exogeneity in terms of the factor�s causal independence from the other variables
included in the model. The statistical or econometric concept emphasizes non-correlation
between the exogenous variable and the other independent variables included in the model. If x​j
is exogenous to a matrix of independent variables X (excluding x​j​), then if we perform a
regression of x​j​ against X (excluding x​j​), we should expect coefficients of 0 for each variable in
X (excluding x​j​). Normal regression models assume that all the independent variables are
exogenous.
Download