UNC-Wilmington Department of Economics and Finance ECN 377 Dr. Chris Dumas Limited Dependent Variable Regression Models--The Probit Model Recall that one of the assumptions of the OLS regression model is that the dependent (Y) variable is a continuous measurement variable. The values of a continuous measurement variable are not limited--the variable can take on fractional/decimal values, and it can range (potentially) from -∞ to +∞. What happens when this assumption is violated? That is, what happens when the values of the dependent (Y) variable in the model are limited in some way? For example, perhaps the values of the Y variable are limited to a range smaller than -∞ to +∞, or perhaps the Y variable is categorical, taking discrete values that label various categories (green, blue, red, for example), or, perhaps the Y variable is a binary/dummy/indicator variable, taking only 0 or 1 as possible values. In all of these cases, we need to use a Limited Dependent (Y) Variable Model. There are several other names for limited dependent variable models, including Categorical Models, Qualitative Choice Models, Binary Choice Models, etc., depending on how the values of the dependent (Y) variable are limited. Problems with OLS Regression When the Dependent Variable is Limited What happens if we ignore the fact that our dependent (Y) variable is limited, and we go ahead and use OLS to estimate a regression model when we have a limited dependent (Y) variable? Several bad things: 1. 2. 3. 4. the OLS estimates of the β's are often biased if the β's are biased, they are also inconsistent (the bias remains even when we increase sample size) the error term (ehats) will be heteroskedastic (the heteroskedasticity problem) when we use the OLS model to make predictions, we can obtain predictions greater than 100% or less than 0%, which make no sense We can correct the heteroskedasticity problem by using weighted least squares (WLS) regression, but we would still have the other problems. So, when we have a limited dependent variable situation, we need to avoid OLS regression and use a modified/adjusted regression model, such as the . . . (drum roll) . . . Probit model! The Probit Regression Model In this handout, we explore the simplest example of a limited dependent variable model, one in which the Y variable is a dummy/indicator variable, taking only 0 or 1 as possible values. This model is called the Probit model.1 There are many situations in the real world in which you might want to build a model to predict a Y variable that takes only two possible values. For example, many survey questions have only two possible responses, such as “yes” or “no”, or “agree” or “disagree.” Also, many questions in economics, finance and marketing involve only two possible answers, such as “raise interest rates” or “don’t raise interest rates,” “buy the stock” or “don’t buy the stock”, “consumer will buy the shampoo” or “consumer won’t buy the shampoo,” etc. In such situations, if we wanted to build a model to predict which of the two possible responses was chosen by an individual in the population, and which factors (X variables) affect that choice, then we could use a Probit model. In the Probit model, instead of constructing a model to directly predict whether the Y variable takes on a value of 0 or 1, we predict the probability (chances) that Y = 1; that is, the model attempts to predict “Prob(Y=1)”. If, for example, the model predicts that Prob(Y=1) = 0.20, then, naturally, Prob(Y=0) = 0.80, because the Y variable can only take on the values of 0 and 1, so the Prob(Y=1) and the Prob(Y=0) must add to 1.00 . This corresponds to the fact that 20% plus 80% must add to 100%. In the Probit model, we want to explain how various independent X variables affect the Prob(Y=1), and we want to predict Prob(Y=1). We begin by defining an index variable, Xindex, based on the X variables in the model plus an error term. As an example, let's consider a model with variables X1 and X2. 1 There is another model that can be used in this situation, called the Logit model, but it is similar to the Probit model. 1 UNC-Wilmington Department of Economics and Finance ECN 377 Dr. Chris Dumas Xindex = β0 +β1X1+β2X2+ e Then, we say that if Xindex becomes greater than zero, then Y = 1; otherwise, Y = 0. We are using Xindex as an indicator of whether or not Y will be equal to 1. So, we're saying: If Xindex > 0, then Y = 1. Otherwise, Y = 0. So, when is Xindex > 0 ? It depends on the error term, e. Remember that the error term is assumed to have a standard normal, bell-shaped distribution, centered on zero, as shown in the graph below. Shaded area is the probability that e < β0 +β1X1+β2X2 frequency(e) 0 Xindex = β0 +β1X1+β2X2 e It turns out that Xindex > 0 when e < β0 +β1X1+β2X2.. So, all we need to do is to find β0 +β1X1+β2X2. on the e axis and then ask ourselves, when is e < β0 +β1X1+β2X2 ? Recall that the area under a distribution curve is a probability. So, the area under the curve to the left of β0 +β1X1+β2X2 gives the probability that e < β0 +β1X1+β2X2. This is the probability that Y = 1. So, if we can find this area, then we've found Prob(Y=1). There is a formula that gives the shaded area in the graph above: The cumulative distribution function (cdf) of the normal distribution, usually denoted “F”, shown below. Because this formula gives the shaded area, it also gives Prob(Y=1) . . . π πππππ₯ ππππ(π = 1) = πΉ(ππππππ₯ ) = ∫−∞ 1 πΈππ(−0.5π 2 )ππ √2π Note: F is the integral of a standard normal “bell- curve” equation. (where "e" in the equation above is the value of e along the horizontal axis of the bell curve graph above) Yes, the formula above is pretty crazy. To understand it, let’s examine its graph below: 2 UNC-Wilmington Department of Economics and Finance ECN 377 Dr. Chris Dumas F(Xindex) 1 0.89 0.5 0 0 Xindex = 1.21 e In the graph above, the horizontal axis shows the possible values of the error term, e. The value of Xindex is also shown on the e axis. F(Xindex) is graphed on the vertical axis, and F(Xindex) can range from 0 to 1. The value of Xindex is fed into the crazy F equation, which turns it into a value of F(Xindex) between 0 and 1 along the vertical axis. This F(Xindex) value is Prob(Y=1). That is, Prob(Y=1) is given by F(Xindex). For example, in the graph above, suppose that we plug the β's and X’s into the Xindex equation and calculate Xindex = 1.21. You find this value along the horizontal axis in the graph, and then you read the height up to the F graph. For Xindex = 1.21, this height is 0.89. The value 0.89 is the value of F(Xindex). Because Prob(Y=1) is equal to F(Xindex), we know that Prob(Y=1) is 0.89 . Therefore, the probability is 89% that the value of Y = 1. Thus, naturally, the probability that Y = 0 is 11%, because 100% - 89% = 11%. Hence, Prob(Y=0) = 0.11 . That is, Prob(Y=0) is given by [1 - F(Xindex)]. Importantly, the values of Prob(Y=1) and Prob(Y=0) in the example above resulted from the particular values of the X’s that we plugged into β0 + β 1X1+ β 2X2. If we change the value of one or more of the X’s that we plug into β 0 + β 1X1+ β 2X2, then we will get different answers for Prob(Y=1) and Prob(Y=0). This is the way in which the values of the X’s affect the values of Prob(Y=1) and Prob(Y=0). Maximum Likelihood Estimation—Finding the Values of the β's in the Probit Model Because the Y variable only takes 0/1 values in the Probit model, violating one of the assumptions of the OLS regression model, we can’t use OLS regression to estimate the β's of the Probit model. Instead, we use “Maximum Likelihood Estimation” (MLE). In MLE, we find the β's that maximize the probability of obtaining the data sample that we actually obtained when we collected our data. For example, suppose we collect the following data sample: 3 UNC-Wilmington Department of Economics and Finance ECN 377 Dr. Chris Dumas Individual Y X1 X2 in sample 1 1 6 2 2 1 1 10 3 0 4 12 4 1 5 1 5 0 2 8 etc. etc. etc. etc. When doing MLE, we ask ourselves: “Given the values of the X’s that are in my sample, what values of the β's would maximize my chances of getting the values of Y that are in my sample.” These are the “most likely” values of the β's; these are the MLE estimates of the β's. To find the MLE estimates of the β's, we first find an equation that gives the probability of obtaining the Y values in our sample, given the X's in our sample, and then we find the values of the β's that maximize the probability of getting those Y's. Recall from the basic laws of probability that, if the probability that Y =1 for the first individual in our sample is Prob(Y1=1), and the probability that Y = 1 for the second individual in our sample is Prob(Y2=1), then the probability that Y = 1 for both individuals at the same time is Prob(Y1=1)* Prob(Y2=1), assuming that the two individuals are independent of one another (which we assume). In the same way, the probability that three individuals have Y =1 at the same time would be three Prob(Y=1)’s multiplied together, and so on, for any number of individuals. In the same way, the probability that Prob(Y=0) for any number of individuals would be all of the individuals’ Prob(Y=0)’s multiplied together. Now, take all of the individuals in the sample that have Y = 1 and multiply their Prob(Y=1)’s together, then take all of the individuals in the sample with Y = 0 and multiply their Prob(Y=0)’s together, and then, (the grand finale) multiply the two products by each other. This would give the probability of getting the Y’s in the sample (both the 1’s and the 0’s) that we actually have in the sample. Recall that Prob(Y=1) is given by F(Xindex), and Prob(Y=0) is given by [1 - F(Xindex)]. Then, the probability of getting the actual Y’s in the data sample is given by the Likelihood Function: The Likelihood Function L = Prob(Y1=1,Y2=1,Y3=0,Y4=1,Y5=0, etc.) = ∏ππ=1 πΉ(ππππππ₯ ) β ∏ππ=0[1 − πΉ(ππππππ₯ )] where the capital Greek letter pi, “Π”, means “multiply together.” So, ∏ππ=1 πΉ(ππππππ₯ ) means, “multiply together all the F(Xindex) values for all of the individuals in the sample who have Y =1,” and ∏ππ=0[1 − πΉ(ππππππ₯ )] means “multiply together all of the [1 - F(Xindex)] values for all of the individuals in the sample who have Y = 0.” Next, because Xindex has β's inside it (recall that Xindex = β 0 + β 1X1+ β 2X2+...), we could try various values of the β's in the Likelihood Function equation until we find the values of the β's that maximize Prob(Y1=1,Y2=1,Y3=0,Y4=1,Y5=0, etc.). (Typically, this is done by first logging the equation, then using the “Classical Calculus Method” to find the first order conditions (FOC’s), and then finding the values of the β's that solve the FOC’s.) The values of the β's that come out of this process are the “Maximum Likelihood Estimates” of the β's for the Probit model. The Maximum Likelihood Estimates are the values of the β's that maximize the chances of getting the Y's that we actually have in our sample, given the X's that we actually have in our sample. 4 UNC-Wilmington Department of Economics and Finance ECN 377 Dr. Chris Dumas Prediction with the Probit Model Actually, this is pretty simple. Recall that the purpose of the Probit model is to predict Prob(Y=1) for various values of the X variables. After we find the MLE estimates of the β's, we plug them into Xindex = β 0 + β 1X1+ β 2X2+..., along with the values of the X’s that we are interested in. After plugging in, we get a number for Xindex. Then, we take the Xindex number and plug it into the graph of the F function to find F(Xindex). Finally, recall that Prob(Y=1) is equal to F(Xindex), so we have found our prediction for Prob(Y=1). Marginal Effects in the Probit Model A “Marginal Effect” is the change in Prob(Y=1) that occurs when we change the value of an X variable by one unit (a marginal amount). Each X variable has its own, different, marginal effect on Prob(Y=1). As an example, let’s discuss the marginal effect of X1 on Prob(Y=1). In the Probit model, the marginal effect of X1 on Prob(Y=1) is given by: πππππ(π = 1) = β1 β [π ππππ ππ πΉ(ππππππ₯ )] ππ1 In a similar way, the marginal effect of X2 on Prob(Y=1) would be given by the same equation, but with β1 replaced by β2. Testing Significance of the Overall Model In OLS regression, we use the F-test to test the statistical significance of the model as a whole. In the Probit model, we use a Likelihood Ratio Test to accomplish a similar hypothesis test. The Likelihood Ratio Test is a test of the following hypotheses: H0: all β's are zero (none of the X’s help to predict Prob(Y=1) ) H1: one or more of the β's is not zero (one or more of the X’s helps to predict Prob(Y=1) ) In a Likelihood Ratio Test, we compare an LRtest number derived from the sample data to a LRcritical number from the chi-square (χ2) table. The formula for LRtest is: π³ π³πΉ LRtest = 2·ππ ( πΌ) = 2·[ln(LU) – ln(LR)] where ln(LU) = log of the maximized value of the Likelihood Function ln(LR) = n·[P·ln(P) + (1-P)·ln(1-P)], and P = the proportion of individuals in the sample with Y =1 LU is the likelihood (chances) of getting the Y values in the sample when all β's in the model take the values that maximize L = Prob(Y1=1,Y2=1,Y3=0,Y4=1,Y5=0, etc.). LR is the likelihood (chances) of getting the Y values in the sample (for example, Y1=1,Y2=1,Y3=0,Y4=1,Y5=0, etc.) when the all β's in the model (except the intercept) are restricted to the value zero. Typically, computer software programs will provide either LU or LRtest in the output of Probit model results. LRcritical is found in the chi-square (χ2) table using d.f. = k – 1, where k is the number of β's in the model. This is a one-sided test. 5 UNC-Wilmington Department of Economics and Finance ECN 377 Dr. Chris Dumas As is typical in hypothesis testing: If LRtest > LRcritical, then Reject H0 and Accept H1. If LRtest < LRcritical, then Accept H0 and Reject H1. Measuring Goodness of Fit of the Probit Model In OLS regression, we use R2 to measure “Goodness of Fit,” that is, how well the regression equation fits the data points. There are many measures of Goodness of Fit for limited dependent variable models such as the Probit model. We will consider here a commonly-used measure of “Goodness of Fit” for the Probit model called McFadden’s R2. The formula for McFadden’s R2 is: R2McFadden = 1 – (ln(LU)/ln(LR)) The value of R2McFadden lies between 0 and 1, with a value of 0 indicating that none of the X variables in the model helps to predict Prob(Y=1). The larger the value of R2McFadden, the better the model fits the data. Sometimes R2McFadden is called the Likelihood Ratio Index (LRI), because its formula contains a ratio of likelihoods. Probit Models in SAS In SAS, there are several different Procs that can be used to estimate Probit models. In this handout, we use PROC QLIM to estimate a Probit model. (The "QLIM" stands for "qualitative and limited" dependent variable models, that is, models that have a Y variable that can take only qualitative or limited values.) In the example program below, the data from dataset cleanNCcounties.xls are used to estimate a Probit model in which the dependent (Y) variable is a 0/1 variable that indicates whether or not a North Carolina county voted Republican in the year 2000 presidential election. The independent (X) variables are the percent of county residents in poverty (PctInPov), the county unemployment rate (UnempRate), an urbanization index (UrbanIndex), the percentage of county residents age 65+ (OldIndex), the percentage of county residents who are college graduates (EducIndex), and an index of manufacturing employment (EmpManfIndex). /* SOFTWARE: SAS Statistical Software program, version 9.2 */ /* AUTHOR: Dr. Chris Dumas, UNC-Wilmington, April, 2015. */ /* TITLE: Probit model with marginal effects regression */ options helpbrowser=sas; options number pageno=1 nodate nolabel font="SAS Monospace" 10; options leftmargin=1.00 in rightmargin=1.00 in topmargin=1.00 in bottommargin=1.00 in; proc import datafile="v:\ECN377\cleanNCcounties.xls" replace; run; dbms=xls out=dataset01 data dataset02; set dataset01; UrbanIndex = UrbanPop/PopCens; OldIndex = Age65More/PopCens; EducIndex = CoColGrads/PopCens; EmpManfIndex = (EmpManf2000/PopCens)*100; run; 6 UNC-Wilmington Department of Economics and Finance ECN 377 Dr. Chris Dumas proc means data=dataset02 n mean max min; var VoteRepub PctInPov UnempRate UrbanIndex OldIndex EducIndex EmpManfIndex; run; /* In the PROC GLIM command below, the "discrete" option tells SAS that the Y variable has discrete values rather than limited continuous values. The "marginal" option in the output command tells SAS to calculate and save the marginal effects in dataset03. */ /* Note: McFadden's LRI is McFadden's R-square */ proc qlim data=dataset02; model VoteRepub = PctInPov UnempRate UrbanIndex OldIndex EducIndex EmpManfIndex / discrete; output out=dataset03 marginal; run; /* The PROC MEANS command below uses the marginal effects that were calculated for each observation and variable by the PROC QLIM command above to calculate the mean marginal effect for each variable. Recall that the “Marginal Effect” of variable X is the change in Prob(Y=1) that occurs when we change the value of X by one unit (a marginal amount). There is a different marginal effect for each X variable. */ proc means data=dataset03 n mean; var Meff_P2_PctInPov Meff_P2_UnempRate Meff_P2_UrbanIndex Meff_P2_OldIndex Meff_P2_EducIndex Meff_P2_EmpManfIndex; run; 7