Regression Analysis-

advertisement
UNC-Wilmington
Department of Economics and Finance
ECN 377
Dr. Chris Dumas
Regression Analysis --Functional Form
The basic OLS Regression method requires that the regression equation be linear in the parameters. However,
the equation may be either linear or nonlinear in the variables. There are many different types of equations that
meet the requirement of being linear in the parameters. These different equations can be used to model various
types of linear and nonlinear relationships among the variables. The study of Functional Form is the study of
which equation type best represents the relationships among the variables under investigation.
The Basic OLS Regression Equation – Linear in the Variables
The basic OLS Regression Equation is not only linear in the parameters, it is also linear in the variables. This
means that the equation should be used when we think that there are linear relationships between the X and Y
variables. The basic (that is, linear) OLS Regression Equation is:
𝑌𝑖 = β0 + β1 ∙ 𝑋1𝑖 + β2 ∙ 𝑋2𝑖 + β3 ∙ 𝑋3𝑖 + ⋯ + 𝑒𝑖
Relationship between Y and X1,
all else held constant
Y
Y
Relationship between Y and X2,
all else held constant
slope = β 1
slope = β 2
Y-intercept = β 0
Y-intercept = β 0
X1
X2
In the basic OLS Regression Equation, β0 is the Y-intercept, and β1, β2, etc., are slopes. For example, β1 is the
slope of the graph of Y against X1, holding all else constant. Similarly, β2 is the slope of the graph of Y against
X2, holding all else constant. In the basic OLS Regression Equation, Y changes by amount β1 when X1 increases
by one unit. The other β’s in the equation are interpreted in a similar way.
Regression through the Origin (avoid it)
Sometimes, based on knowledge of the situation under study, it might make sense to set the intercept, β0, equal to
zero; that is, drop it from the regression equation. If one thought that Y would be zero when all X’s were zero,
then this might seem to make sense. For example, if Y were output of cars per month and the X’s were amounts
of various production inputs used to produce cars, it might make sense that you would produce zero output if you
used zero inputs. If one leaves the intercept out of the equation, this is called “Regression through the Origin,”
because the regression line would have an intercept at Y = 0, that is, at the origin. Usually, one should allow the
intercept β0 to remain in the equation and avoid Regression through the Origin. Regression through the Origin is
basically assuming that β0 equals zero. Instead, allow the regression analysis to determine whether or not the
intercept β0 is equal to zero. If the intercept truly is zero, then the regression analysis will estimate a value for β0
that is close to zero, and a t-test of β0 would indicate that it (β0) is not significantly different from zero.
Furthermore, if the intercept is dropped from the equation, then the formula used to calculate R2 is not valid, so R2
cannot be used to assess goodness of fit of the regression equation.
1
UNC-Wilmington
Department of Economics and Finance
ECN 377
Dr. Chris Dumas
Polynomial Terms in Regression Equations
In mathematics, a “term” is a part of an equation that is separated from other parts of the equation by plusses and
minuses. For example, the terms in the basic OLS regression equation are circled below:
𝑌𝑖 = β0 + β1 ∙ 𝑋1𝑖 + β2 ∙ 𝑋2𝑖 + β3 ∙ 𝑋3𝑖 + ⋯ + 𝑒𝑖
As discussed earlier in this handout, the terms in the basic OLS Regression Equation above allow for linear
relationships between the X variables and the Y variable. However, what should be done if we suspect nonlinear
relationships between the X variables and the Y variable? For example, what if we suspect a nonlinear
relationship between X1 and Y?
Nonlinear relationships between variables can be modeled in an OLS Regression Equation by including additional
polynomial terms in the equation. Recall from basic algebra that a polynomial relationship between Y and X1,
for example, is written as:
𝑌𝑖 = β0 + β1 ∙ 𝑋1 + β2 ∙ 𝑋12 + β3 ∙ 𝑋13 + ⋯
The terms in the polynomial equation above allow for “bends” (that is, nonlinearities) in the relationship between
Y and X1. If the relationship between Y and X1 has zero bends, that is, if the relationship is a straight line, only
the linear term from the polynomial is included in the regression equation:
only the linear term of the polynomial is included in regression equation
𝑌𝑖 = β0 + β1 ∙ 𝑋1 + β2 ∙ 𝑋2𝑖 + β3 ∙ 𝑋3𝑖 + ⋯ + 𝑒𝑖
If the relationship has one bend (that is, if the relationship is a quadratic equation / parabola, or “horseshoe”), then
we add the next, X-squared, term from the polynomial to the regression equation:
the linear and the quadratic terms of the polynomial are included in regression equation
𝑌𝑖 = β0 + β1 ∙ 𝑋1 + β2 ∙ 𝑋12 + β3 ∙ 𝑋2𝑖 + β4 ∙ 𝑋3𝑖 + ⋯ + 𝑒𝑖
Y
Y
X1
X1
2
UNC-Wilmington
Department of Economics and Finance
ECN 377
Dr. Chris Dumas
If the relationship has two bends (that is, if the relationship is “S-shaped”), then we add the next, X-cubed, term
from the polynomial to the regression equation:
𝑌𝑖 = β0 + β1 ∙ 𝑋1 + β2 ∙ 𝑋12 + β3 ∙ 𝑋13 + β4 ∙ 𝑋2𝑖 + β5 ∙ 𝑋3𝑖 + ⋯ + 𝑒𝑖 polynomial terms for S-shaped relationship
Y
Y
X1
X1
If you are unsure whether the relationship between Y and an X variable is nonlinear, go ahead and include the Xsquared and possibly the X-cubed terms in the regression equation. If the relationship is, in fact, nonlinear, then
the β’s on the X-squared and/or the X-cubed terms will be different from zero. If the relationship is linear, then
the β’s on the X-squared and X-cubed terms will be zero, causing the X-squared and X-cubed terms to drop out of
the equation.
Reciprocal Terms in Regression Equations
Suppose you have a variable X that affects Y, but its effect on Y reaches a “ceiling” or “floor,” beyond which
further changes in the X variable have no further effects on Y. Such “ceiling” or “floor” relationships between
and X variable and a Y variable can be captured in a regression equation by using a “reciprocal term.” The
reciprocal term for variable X is simply 1/X. So, if we suspected, for example, that the effect of variable X 2 on
Y would reach a ceiling or a floor, we would include 1/X2 in the regression equation in place of X2, as shown
below:
1
𝑌𝑖 = β0 + β1 ∙ 𝑋1𝑖 + β2 ∙ ( ) + β3 ∙ 𝑋3𝑖 + ⋯ + 𝑒𝑖
𝑋2𝑖
The coefficient attached to the reciprocal term (for example, β2 in the regression equation above) determines
whether Y approaches a ceiling or a floor as X grows larger. If the coefficient is positive, then Y will approach a
floor as X increases. If the coefficient is negative, then Y will approach a ceiling as X increases.
β2 < 0
Y approaches ceiling as X increases
β2 > 0
Y approaches floor as X increases
Y
Y
ceiling
floor
X2
X2
The height of the floor or ceiling (as measured along the Y axis) depends on the other (non-reciprocal) terms in
the regression equation. For the example above, the level of the floor or ceiling is: β0 + β1 ∙ 𝑋1𝑖 + β3 ∙ 𝑋3𝑖 .
3
UNC-Wilmington
Department of Economics and Finance
ECN 377
Dr. Chris Dumas
One warning regarding reciprocal terms: If the value of X in a reciprocal term is zero, then 1/X is undefined. So,
any row of data where X is zero would be dropped from the sample used to estimate a regression equation that
included 1/X.
Interaction Effect Terms in Regression Equations
Suppose you believe that one of the X variables in a regression equation either increases or decreases the effect of
another X variable on Y. For example, an increase in variable X1 might strengthen variable X2’s effect on Y.
Such “interaction effects” between two X variables can be captured by adding an interaction effect term to the
regression equation. If we believe that two X variables are interacting, the interaction effect term is simply the
two X variables multiplied together. For example, if we believe that X1 and X2 are interacting, we would add the
term X1·X2 to the regression equation. We would also give this term a “β” coefficient, as shown below:
regression equation with interaction term for X1 and X2
𝑌𝑖 = β0 + β1 ∙ 𝑋1𝑖 + β2 ∙ 𝑋2𝑖 + β3 ∙ 𝑋3𝑖 + β4 ∙ 𝑋1𝑖 ∙ 𝑋2𝑖 + ⋯ + 𝑒𝑖
When an X variable is included in an interaction term, it has two effects on Y--a direct effect, and an indirect
effect. For example, in the equation above, the direct effect of variable X1 on Y is “β1 ∙ 𝑋1𝑖 ” and the indirect
effect is “β4 ∙ 𝑋1𝑖 ∙ 𝑋2𝑖 ”. Through the direct effect, variable X1 directly affects Y. Through the indirect effect,
variable X1 changes variable X2’s effect on Y. If β4 is positive, then an increase in X1 increases the effect of X2
on Y, but if β4 is negative, then an increase in X1 decreases the effect of X2 on Y. Similarly, because variable X2
is also included in the interaction term, variable X2 also has both direct and indirect effects on Y. The direct
effect of variable X2 on Y is “β2 ∙ 𝑋2𝑖 ” and the indirect effect is “β4 ∙ 𝑋1𝑖 ∙ 𝑋2𝑖 ”. Of course, the size of β4
determines the strength of the indirect, interaction effect; if β4 is small, the indirect/interaction effect is weak, and
if β4 is large, the indirect/interaction effect is strong.
Just as we included an interaction term for variables X1 and X2 in the regression equation above, we could include
additional interaction terms for X1 and X3, say, or X2 and X3, if we suspected interactions between these pairs of
variables.
If you are not certain whether or not two X variables are interacting, go ahead and include an interaction term in
the regression equation. If there is not enough evidence in the data to support the existence of an interaction
effect, then the regression results will tell you that the β coefficient on the interaction term is zero, causing it to
“drop out” of the equation.
“Double-Log” or “Log-Log’ or “Cobb-Douglas” Regression Equations
Recall that one of the assumptions of the O.L.S. regression method is that the regression equation is linear in the
parameters. There are many equations that are not linear in the parameters, but, if they are transformed
mathematically, then they become linear in the parameters. After the transformation, theses equations can be used
in regression analysis. This is important, because sometimes we need these equations to represent types of
nonlinear relationships that our “regular,” linear regression equation can’t represent.
In Economics, one of the most commonly-used equations that is nonlinear in the parameters in its original form-but that can be transformed into an equation that is linear in the parameters--is the “Double-Log,” or “Log-Log,”
or “Cobb-Douglas” equation. (https://en.wikipedia.org/wiki/Cobb%E2%80%93Douglas_production_function)
4
UNC-Wilmington
Department of Economics and Finance
ECN 377
Dr. Chris Dumas
The economist Paul Douglas developed the Cobb–Douglas production
function in 1927; when trying to understand the relationship between
national economic output, the number of workers in the economy, and
the amount of capital in the economy. He spoke with mathematician
and colleague Charles Cobb
(https://en.wikipedia.org/wiki/Charles_Cobb_(economist)), who
suggested a function of the form Y = aLbKc. . . . "The Cobb–Douglas
production function is especially notable for being the first time an
aggregate, or economy-wide, production function had been developed,
estimated, and then presented to the profession for analysis; it marked
a landmark change in how economists approached macroeconomics."
Paul Douglas
https://en.wikipedia.org/wiki/
Paul_Douglas
The Cobb-Douglas equation is shown below:
The Cobb-Douglas Equation
β
β
β
𝑌𝑖 = β0 ∙ 𝑋1𝑖1 ∙ 𝑋2𝑖2 ∙ 𝑋3𝑖3 ∙ … ∙ 𝑒 𝑒𝑖
(note: the base e in 𝑒 𝑒𝑖 is the math constant e = 2.718282, and the exponent ei is the error term)
Notice that the Cobb-Douglas equation is different from the standard, linear regression equation in that in the
Cobb-Douglas equation: (1) the X variables are multiplied together rather than added together and (2) the β
coefficient for each X variable appears as an exponent rather than simply multiplying the X variable.
Of course, the Cobb-Douglas equation is useful when you suspect that the effects of the X’s multiply each other
(rather than add to each other) before effecting Y. Even more importantly, the Cobb-Douglas equation is useful
for modeling many types of nonlinear relationships among the variables in the equation (it can also be used for
linear relationships, but only those linear relationships with Y-intercepts at the origin, see graphs below).
Depending on the values of the β’s, the relationship between Y and one of the X’s can take many different shapes,
illustrated below for variable X1:
Y
β1 > 1
Y
β1 < 0
(the X and Y axes
are asymptotes here)
X1
Y
0 < β1 < 1
X1
Y
β1 = 1
(no ceiling)
(Y-intercept must be at
the origin)
X1
X1
5
UNC-Wilmington
Department of Economics and Finance
ECN 377
Dr. Chris Dumas
Y
β1 = 0
X1
Because the Cobb-Douglas equation can be used to model so many types of relationship between the Y and X
variables, it is known as a Flexible Form (of equation). In more advanced Econometrics classes, you will study
equations that are even more flexible than the Cobb-Douglas.
Unfortunately, because the Cobb-Douglas equation is not linear in the parameters, we cannot use it directly
in O.L.S. regression analysis. However, we can transform the Cobb-Douglas equation into a different form
that is linear in the parameters so that we can use it in regression analysis. We transform the Cobb-Douglas
equation by taking the logarithm of each side of the equation (this is why the equation is also called the “DoubleLog” or “Log-Log” equation):
Begin with the Cobb-Douglas equation . . .
β
β
β
𝑌𝑖 = β0 ∙ 𝑋1𝑖1 ∙ 𝑋2𝑖2 ∙ 𝑋3𝑖3 ∙ … ∙ 𝑒 𝑒𝑖
take logs of both sides . . .
β
β
β
log⁡(𝑌𝑖 ) = log⁡(β0 ∙ 𝑋1𝑖1 ∙ 𝑋2𝑖2 ∙ 𝑋3𝑖3 ∙ … ∙ 𝑒 𝑒𝑖 )
use log rule: “the log of a product is the sum of logs” . . .
β
β
β
log⁡(𝑌𝑖 ) = log(β0 ) + log (𝑋1𝑖1 ) + log⁡(𝑋2𝑖2 ) + log⁡(𝑋3𝑖3 ) + ⋯ + log⁡(𝑒 𝑒𝑖 )
use log rule: “the log of a variable raised to an exponent is the exponent times the log of the variable” . . .
log⁡(𝑌𝑖 ) = log(β0 ) + β1 log(𝑋1𝑖 ) + β2 log(𝑋2𝑖 ) + β3 log(𝑋3𝑖 ) + ⋯ + 𝑒𝑖 log⁡(𝑒)
finally, on the very last term, use the log rule: “the log of constant ‘e’ is equal to one” . . .
The Cobb-Douglas Equation in Double-Log Form
log⁡(𝑌𝑖 ) = log(β0 ) + β1 log(𝑋1𝑖 ) + β2 log(𝑋2𝑖 ) + β3 log(𝑋3𝑖 ) + ⋯ + 𝑒𝑖
The boxed equation above is the Cobb-Douglas equation in Double-Log form. This form of the equation is linear
in the parameters, so it can be used in regression analysis (Although parameter β0 appears inside a log, notice that
since β0 is a constant, log(β0) is also a constant, so we can think of “log(β0)” as one big parameter.). Several
notes about using the Double-Log form in regression analysis:



notice that you must take logs of the Y and X variables and use these logged Y and X’s in the regression
analysis,
for each individual in the sample, if one or more of the variables in the regression equation (either the Y
or the X’s) is equal to zero, then that individual will be left out of the regression analysis, because the log
of zero is undefined,
the regression analysis will give you estimates of β1, β2, β3, etc. directly, but you must do a little sidecalculation to find the estimate of β0.. The regression analysis will give you the value of log(β0) as the
6
UNC-Wilmington
ECN 377
Department of Economics and Finance
Dr. Chris Dumas
intercept; you must use the EXP button on your calculator to remove the log and find the unlogged value
of β0; for example:
o
o
o
o
the regression gives you: intercept = 2.45
for the Double-Log equation, this means that: log(β0) = 2.45
you must take the 2.45 and exponentiate it to remove the log: EXP(2.45) = 11.588
so, the value of β0 in the original Cobb-Douglas equation is 11.588
In Economics, we are often interested in the Elasticity of the relationship between two variables. Recall that the
definition of the Elasticity of the relationship between variables Y and X1, say, is the percentage change in Y that
results from a percentage change in X1. In math, this would be symbolized as:
%∆𝑌
elasticity of Y with respect to X1 = %∆𝑋
1
It turns out that for the Cobb-Douglas equation, the β’s are the elasticities. So, for example,
elasticity of Y with respect to X1 = β1
elasticity of Y with respect to X2 = β2
elasticity of Y with respect to X3 = β3
etc. ...
This fact makes the Cobb-Douglas equation popular when one is trying to estimate elasticities of demand,
elasticities of supply, etc.
How Do We Decide Which Terms to Include in the Regression Equation?
Each X variable should be included in a regression equation in a way that reflects its suspected relationship with
Y. The form of the suspected relationship can be based on Economic theory, previous studies, past experience,
and opinions of experts familiar with the situation under investigation, etc. If you suspect that X might have a
nonlinear effect on Y, then include an X-squared term in the regression. If you suspect that the effect of X on Y
might hit a ceiling or floor, then include a term with the reciprocal of X in the regression. If you suspect that X 1
might affect X2’s effect on Y, then include an interaction term between X1 and X2. If you don’t know what else to
do, then just include X as a linear term in the regression equation.
If you have no idea about how Y and the various X’s in the regression equation might be related, you can look for
ideas in the sample data by examining plots of each X variable against Y, as well as plots of each X variable
against every other X variable. One way to quickly generate these plots using SAS is to use PROC CORR to
generate a Scatterplot Matrix (see the handout on Correlation Analysis). If the plot of X1 against Y in the
scatterplot matrix shows a U-shaped pattern, then you might want to include an X1-squared term in the regression
equation. If the plot of points appears to reach a ceiling or floor, then you might want to include the reciprocal of
X1 in the regression equation instead of X1.
Example: Determining Functional Form for a Regression Equation
The Price Paid by a Consumer for a Pickup Truck. (We will do this example in lecture.)
7
Download