UNC-Wilmington Department of Economics and Finance ECN 377 Dr. Chris Dumas Regression Analysis --Functional Form The basic OLS Regression method requires that the regression equation be linear in the parameters. However, the equation may be either linear or nonlinear in the variables. There are many different types of equations that meet the requirement of being linear in the parameters. These different equations can be used to model various types of linear and nonlinear relationships among the variables. The study of Functional Form is the study of which equation type best represents the relationships among the variables under investigation. The Basic OLS Regression Equation – Linear in the Variables The basic OLS Regression Equation is not only linear in the parameters, it is also linear in the variables. This means that the equation should be used when we think that there are linear relationships between the X and Y variables. The basic (that is, linear) OLS Regression Equation is: 𝑌𝑖 = β0 + β1 ∙ 𝑋1𝑖 + β2 ∙ 𝑋2𝑖 + β3 ∙ 𝑋3𝑖 + ⋯ + 𝑒𝑖 Relationship between Y and X1, all else held constant Y Y Relationship between Y and X2, all else held constant slope = β 1 slope = β 2 Y-intercept = β 0 Y-intercept = β 0 X1 X2 In the basic OLS Regression Equation, β0 is the Y-intercept, and β1, β2, etc., are slopes. For example, β1 is the slope of the graph of Y against X1, holding all else constant. Similarly, β2 is the slope of the graph of Y against X2, holding all else constant. In the basic OLS Regression Equation, Y changes by amount β1 when X1 increases by one unit. The other β’s in the equation are interpreted in a similar way. Regression through the Origin (avoid it) Sometimes, based on knowledge of the situation under study, it might make sense to set the intercept, β0, equal to zero; that is, drop it from the regression equation. If one thought that Y would be zero when all X’s were zero, then this might seem to make sense. For example, if Y were output of cars per month and the X’s were amounts of various production inputs used to produce cars, it might make sense that you would produce zero output if you used zero inputs. If one leaves the intercept out of the equation, this is called “Regression through the Origin,” because the regression line would have an intercept at Y = 0, that is, at the origin. Usually, one should allow the intercept β0 to remain in the equation and avoid Regression through the Origin. Regression through the Origin is basically assuming that β0 equals zero. Instead, allow the regression analysis to determine whether or not the intercept β0 is equal to zero. If the intercept truly is zero, then the regression analysis will estimate a value for β0 that is close to zero, and a t-test of β0 would indicate that it (β0) is not significantly different from zero. Furthermore, if the intercept is dropped from the equation, then the formula used to calculate R2 is not valid, so R2 cannot be used to assess goodness of fit of the regression equation. 1 UNC-Wilmington Department of Economics and Finance ECN 377 Dr. Chris Dumas Polynomial Terms in Regression Equations In mathematics, a “term” is a part of an equation that is separated from other parts of the equation by plusses and minuses. For example, the terms in the basic OLS regression equation are circled below: 𝑌𝑖 = β0 + β1 ∙ 𝑋1𝑖 + β2 ∙ 𝑋2𝑖 + β3 ∙ 𝑋3𝑖 + ⋯ + 𝑒𝑖 As discussed earlier in this handout, the terms in the basic OLS Regression Equation above allow for linear relationships between the X variables and the Y variable. However, what should be done if we suspect nonlinear relationships between the X variables and the Y variable? For example, what if we suspect a nonlinear relationship between X1 and Y? Nonlinear relationships between variables can be modeled in an OLS Regression Equation by including additional polynomial terms in the equation. Recall from basic algebra that a polynomial relationship between Y and X1, for example, is written as: 𝑌𝑖 = β0 + β1 ∙ 𝑋1 + β2 ∙ 𝑋12 + β3 ∙ 𝑋13 + ⋯ The terms in the polynomial equation above allow for “bends” (that is, nonlinearities) in the relationship between Y and X1. If the relationship between Y and X1 has zero bends, that is, if the relationship is a straight line, only the linear term from the polynomial is included in the regression equation: only the linear term of the polynomial is included in regression equation 𝑌𝑖 = β0 + β1 ∙ 𝑋1 + β2 ∙ 𝑋2𝑖 + β3 ∙ 𝑋3𝑖 + ⋯ + 𝑒𝑖 If the relationship has one bend (that is, if the relationship is a quadratic equation / parabola, or “horseshoe”), then we add the next, X-squared, term from the polynomial to the regression equation: the linear and the quadratic terms of the polynomial are included in regression equation 𝑌𝑖 = β0 + β1 ∙ 𝑋1 + β2 ∙ 𝑋12 + β3 ∙ 𝑋2𝑖 + β4 ∙ 𝑋3𝑖 + ⋯ + 𝑒𝑖 Y Y X1 X1 2 UNC-Wilmington Department of Economics and Finance ECN 377 Dr. Chris Dumas If the relationship has two bends (that is, if the relationship is “S-shaped”), then we add the next, X-cubed, term from the polynomial to the regression equation: 𝑌𝑖 = β0 + β1 ∙ 𝑋1 + β2 ∙ 𝑋12 + β3 ∙ 𝑋13 + β4 ∙ 𝑋2𝑖 + β5 ∙ 𝑋3𝑖 + ⋯ + 𝑒𝑖 polynomial terms for S-shaped relationship Y Y X1 X1 If you are unsure whether the relationship between Y and an X variable is nonlinear, go ahead and include the Xsquared and possibly the X-cubed terms in the regression equation. If the relationship is, in fact, nonlinear, then the β’s on the X-squared and/or the X-cubed terms will be different from zero. If the relationship is linear, then the β’s on the X-squared and X-cubed terms will be zero, causing the X-squared and X-cubed terms to drop out of the equation. Reciprocal Terms in Regression Equations Suppose you have a variable X that affects Y, but its effect on Y reaches a “ceiling” or “floor,” beyond which further changes in the X variable have no further effects on Y. Such “ceiling” or “floor” relationships between and X variable and a Y variable can be captured in a regression equation by using a “reciprocal term.” The reciprocal term for variable X is simply 1/X. So, if we suspected, for example, that the effect of variable X 2 on Y would reach a ceiling or a floor, we would include 1/X2 in the regression equation in place of X2, as shown below: 1 𝑌𝑖 = β0 + β1 ∙ 𝑋1𝑖 + β2 ∙ ( ) + β3 ∙ 𝑋3𝑖 + ⋯ + 𝑒𝑖 𝑋2𝑖 The coefficient attached to the reciprocal term (for example, β2 in the regression equation above) determines whether Y approaches a ceiling or a floor as X grows larger. If the coefficient is positive, then Y will approach a floor as X increases. If the coefficient is negative, then Y will approach a ceiling as X increases. β2 < 0 Y approaches ceiling as X increases β2 > 0 Y approaches floor as X increases Y Y ceiling floor X2 X2 The height of the floor or ceiling (as measured along the Y axis) depends on the other (non-reciprocal) terms in the regression equation. For the example above, the level of the floor or ceiling is: β0 + β1 ∙ 𝑋1𝑖 + β3 ∙ 𝑋3𝑖 . 3 UNC-Wilmington Department of Economics and Finance ECN 377 Dr. Chris Dumas One warning regarding reciprocal terms: If the value of X in a reciprocal term is zero, then 1/X is undefined. So, any row of data where X is zero would be dropped from the sample used to estimate a regression equation that included 1/X. Interaction Effect Terms in Regression Equations Suppose you believe that one of the X variables in a regression equation either increases or decreases the effect of another X variable on Y. For example, an increase in variable X1 might strengthen variable X2’s effect on Y. Such “interaction effects” between two X variables can be captured by adding an interaction effect term to the regression equation. If we believe that two X variables are interacting, the interaction effect term is simply the two X variables multiplied together. For example, if we believe that X1 and X2 are interacting, we would add the term X1·X2 to the regression equation. We would also give this term a “β” coefficient, as shown below: regression equation with interaction term for X1 and X2 𝑌𝑖 = β0 + β1 ∙ 𝑋1𝑖 + β2 ∙ 𝑋2𝑖 + β3 ∙ 𝑋3𝑖 + β4 ∙ 𝑋1𝑖 ∙ 𝑋2𝑖 + ⋯ + 𝑒𝑖 When an X variable is included in an interaction term, it has two effects on Y--a direct effect, and an indirect effect. For example, in the equation above, the direct effect of variable X1 on Y is “β1 ∙ 𝑋1𝑖 ” and the indirect effect is “β4 ∙ 𝑋1𝑖 ∙ 𝑋2𝑖 ”. Through the direct effect, variable X1 directly affects Y. Through the indirect effect, variable X1 changes variable X2’s effect on Y. If β4 is positive, then an increase in X1 increases the effect of X2 on Y, but if β4 is negative, then an increase in X1 decreases the effect of X2 on Y. Similarly, because variable X2 is also included in the interaction term, variable X2 also has both direct and indirect effects on Y. The direct effect of variable X2 on Y is “β2 ∙ 𝑋2𝑖 ” and the indirect effect is “β4 ∙ 𝑋1𝑖 ∙ 𝑋2𝑖 ”. Of course, the size of β4 determines the strength of the indirect, interaction effect; if β4 is small, the indirect/interaction effect is weak, and if β4 is large, the indirect/interaction effect is strong. Just as we included an interaction term for variables X1 and X2 in the regression equation above, we could include additional interaction terms for X1 and X3, say, or X2 and X3, if we suspected interactions between these pairs of variables. If you are not certain whether or not two X variables are interacting, go ahead and include an interaction term in the regression equation. If there is not enough evidence in the data to support the existence of an interaction effect, then the regression results will tell you that the β coefficient on the interaction term is zero, causing it to “drop out” of the equation. “Double-Log” or “Log-Log’ or “Cobb-Douglas” Regression Equations Recall that one of the assumptions of the O.L.S. regression method is that the regression equation is linear in the parameters. There are many equations that are not linear in the parameters, but, if they are transformed mathematically, then they become linear in the parameters. After the transformation, theses equations can be used in regression analysis. This is important, because sometimes we need these equations to represent types of nonlinear relationships that our “regular,” linear regression equation can’t represent. In Economics, one of the most commonly-used equations that is nonlinear in the parameters in its original form-but that can be transformed into an equation that is linear in the parameters--is the “Double-Log,” or “Log-Log,” or “Cobb-Douglas” equation. (https://en.wikipedia.org/wiki/Cobb%E2%80%93Douglas_production_function) 4 UNC-Wilmington Department of Economics and Finance ECN 377 Dr. Chris Dumas The economist Paul Douglas developed the Cobb–Douglas production function in 1927; when trying to understand the relationship between national economic output, the number of workers in the economy, and the amount of capital in the economy. He spoke with mathematician and colleague Charles Cobb (https://en.wikipedia.org/wiki/Charles_Cobb_(economist)), who suggested a function of the form Y = aLbKc. . . . "The Cobb–Douglas production function is especially notable for being the first time an aggregate, or economy-wide, production function had been developed, estimated, and then presented to the profession for analysis; it marked a landmark change in how economists approached macroeconomics." Paul Douglas https://en.wikipedia.org/wiki/ Paul_Douglas The Cobb-Douglas equation is shown below: The Cobb-Douglas Equation β β β 𝑌𝑖 = β0 ∙ 𝑋1𝑖1 ∙ 𝑋2𝑖2 ∙ 𝑋3𝑖3 ∙ … ∙ 𝑒 𝑒𝑖 (note: the base e in 𝑒 𝑒𝑖 is the math constant e = 2.718282, and the exponent ei is the error term) Notice that the Cobb-Douglas equation is different from the standard, linear regression equation in that in the Cobb-Douglas equation: (1) the X variables are multiplied together rather than added together and (2) the β coefficient for each X variable appears as an exponent rather than simply multiplying the X variable. Of course, the Cobb-Douglas equation is useful when you suspect that the effects of the X’s multiply each other (rather than add to each other) before effecting Y. Even more importantly, the Cobb-Douglas equation is useful for modeling many types of nonlinear relationships among the variables in the equation (it can also be used for linear relationships, but only those linear relationships with Y-intercepts at the origin, see graphs below). Depending on the values of the β’s, the relationship between Y and one of the X’s can take many different shapes, illustrated below for variable X1: Y β1 > 1 Y β1 < 0 (the X and Y axes are asymptotes here) X1 Y 0 < β1 < 1 X1 Y β1 = 1 (no ceiling) (Y-intercept must be at the origin) X1 X1 5 UNC-Wilmington Department of Economics and Finance ECN 377 Dr. Chris Dumas Y β1 = 0 X1 Because the Cobb-Douglas equation can be used to model so many types of relationship between the Y and X variables, it is known as a Flexible Form (of equation). In more advanced Econometrics classes, you will study equations that are even more flexible than the Cobb-Douglas. Unfortunately, because the Cobb-Douglas equation is not linear in the parameters, we cannot use it directly in O.L.S. regression analysis. However, we can transform the Cobb-Douglas equation into a different form that is linear in the parameters so that we can use it in regression analysis. We transform the Cobb-Douglas equation by taking the logarithm of each side of the equation (this is why the equation is also called the “DoubleLog” or “Log-Log” equation): Begin with the Cobb-Douglas equation . . . β β β 𝑌𝑖 = β0 ∙ 𝑋1𝑖1 ∙ 𝑋2𝑖2 ∙ 𝑋3𝑖3 ∙ … ∙ 𝑒 𝑒𝑖 take logs of both sides . . . β β β log(𝑌𝑖 ) = log(β0 ∙ 𝑋1𝑖1 ∙ 𝑋2𝑖2 ∙ 𝑋3𝑖3 ∙ … ∙ 𝑒 𝑒𝑖 ) use log rule: “the log of a product is the sum of logs” . . . β β β log(𝑌𝑖 ) = log(β0 ) + log (𝑋1𝑖1 ) + log(𝑋2𝑖2 ) + log(𝑋3𝑖3 ) + ⋯ + log(𝑒 𝑒𝑖 ) use log rule: “the log of a variable raised to an exponent is the exponent times the log of the variable” . . . log(𝑌𝑖 ) = log(β0 ) + β1 log(𝑋1𝑖 ) + β2 log(𝑋2𝑖 ) + β3 log(𝑋3𝑖 ) + ⋯ + 𝑒𝑖 log(𝑒) finally, on the very last term, use the log rule: “the log of constant ‘e’ is equal to one” . . . The Cobb-Douglas Equation in Double-Log Form log(𝑌𝑖 ) = log(β0 ) + β1 log(𝑋1𝑖 ) + β2 log(𝑋2𝑖 ) + β3 log(𝑋3𝑖 ) + ⋯ + 𝑒𝑖 The boxed equation above is the Cobb-Douglas equation in Double-Log form. This form of the equation is linear in the parameters, so it can be used in regression analysis (Although parameter β0 appears inside a log, notice that since β0 is a constant, log(β0) is also a constant, so we can think of “log(β0)” as one big parameter.). Several notes about using the Double-Log form in regression analysis: notice that you must take logs of the Y and X variables and use these logged Y and X’s in the regression analysis, for each individual in the sample, if one or more of the variables in the regression equation (either the Y or the X’s) is equal to zero, then that individual will be left out of the regression analysis, because the log of zero is undefined, the regression analysis will give you estimates of β1, β2, β3, etc. directly, but you must do a little sidecalculation to find the estimate of β0.. The regression analysis will give you the value of log(β0) as the 6 UNC-Wilmington ECN 377 Department of Economics and Finance Dr. Chris Dumas intercept; you must use the EXP button on your calculator to remove the log and find the unlogged value of β0; for example: o o o o the regression gives you: intercept = 2.45 for the Double-Log equation, this means that: log(β0) = 2.45 you must take the 2.45 and exponentiate it to remove the log: EXP(2.45) = 11.588 so, the value of β0 in the original Cobb-Douglas equation is 11.588 In Economics, we are often interested in the Elasticity of the relationship between two variables. Recall that the definition of the Elasticity of the relationship between variables Y and X1, say, is the percentage change in Y that results from a percentage change in X1. In math, this would be symbolized as: %∆𝑌 elasticity of Y with respect to X1 = %∆𝑋 1 It turns out that for the Cobb-Douglas equation, the β’s are the elasticities. So, for example, elasticity of Y with respect to X1 = β1 elasticity of Y with respect to X2 = β2 elasticity of Y with respect to X3 = β3 etc. ... This fact makes the Cobb-Douglas equation popular when one is trying to estimate elasticities of demand, elasticities of supply, etc. How Do We Decide Which Terms to Include in the Regression Equation? Each X variable should be included in a regression equation in a way that reflects its suspected relationship with Y. The form of the suspected relationship can be based on Economic theory, previous studies, past experience, and opinions of experts familiar with the situation under investigation, etc. If you suspect that X might have a nonlinear effect on Y, then include an X-squared term in the regression. If you suspect that the effect of X on Y might hit a ceiling or floor, then include a term with the reciprocal of X in the regression. If you suspect that X 1 might affect X2’s effect on Y, then include an interaction term between X1 and X2. If you don’t know what else to do, then just include X as a linear term in the regression equation. If you have no idea about how Y and the various X’s in the regression equation might be related, you can look for ideas in the sample data by examining plots of each X variable against Y, as well as plots of each X variable against every other X variable. One way to quickly generate these plots using SAS is to use PROC CORR to generate a Scatterplot Matrix (see the handout on Correlation Analysis). If the plot of X1 against Y in the scatterplot matrix shows a U-shaped pattern, then you might want to include an X1-squared term in the regression equation. If the plot of points appears to reach a ceiling or floor, then you might want to include the reciprocal of X1 in the regression equation instead of X1. Example: Determining Functional Form for a Regression Equation The Price Paid by a Consumer for a Pickup Truck. (We will do this example in lecture.) 7