Econometric Methods II (EBC2120) Lecture 6 Contact details: Martin Schumann m.schumann@maastrichtuniversity.nl Office number currently unknown 1 / 42 Binary choice models Last lecture, we discussed Maximum Likelihood estimation, which is often used to estimate nonlinear models. Nonlinear models arise frequently when the response variable is not continuous but rather “jumps”. In binary choice models, we observe a binary variable ⎧ ⎪ ⎪1 if “success” (i is employed, bought something etc.) Yi = ⎨ ⎪ 0 if “failure” ⎪ ⎩ Our goal is to model Pr(Yi = 1∣Xi ), the probability of success conditional on some observed covariates Xi . Since Yi is Bernoulli-distributed, E[Yi ∣Xi ] = 0 ⋅ Pr(Yi = 0∣Xi ) + 1 ⋅ Pr(Yi = 1∣Xi ) = Pr(Yi = 1∣Xi ). 2 / 42 Motivation: latent variable Binary choice models can also be motivated using a latent variable, i.e. an unobserved variable Yi∗ with the observed dependent variable Yi depending on it. We assume that Yi∗ follows a linear model: Yi∗ = Xi′ θ + ϵi , i = 1, ..., n, where Yi∗ for instance represents firm i’s expected returns from innovation while Xi contains observed firm characteristics. A researcher may only be able to observe whether or not firm i conducts innovation, i.e. we observe ⎧ ⎪ ⎪1, Yi∗ = Xi′ θ + ϵi > 0 Yi = ⎨ ⎪ 0, Yi∗ = Xi′ θ + ϵi ≤ 0 ⎪ ⎩ . 3 / 42 Let’s assume that the model error has a distribution F (⋅) that is symmetric around 0 (the center is wlog if Xi contains a constant regressor) and continuous. Then, Pr(Yi = 1∣Xi ) = Pr(Yi∗ > 0∣Xi ) = Pr(Xi′ θ + ϵi > 0∣Xi ) = Pr(ϵi > −Xi′ θ∣Xi ) = 1 − F (−Xi′ θ) = F (Xi′ θ), where we used the symmetry around 0 in the last equation. Thus, in total, E[Yi ∣Xi ] = Pr(Yi = 1∣Xi ) = F (Xi′ θ). (1) Compare this with the linear regression model E[Yi ∣Xi ] = Xi′ θ. (2) 4 / 42 Marginal (or partial) effects The nonlinearity in (1) has important consequences for the interpretation of θ: In (2), mik (Xi , θ) = ∂ E[Yi ∣Xi ] = θk = mit (θk ), ∂Xik where the k-index refers to the (continuous) k-th component of Xi . In (1), mik (Xi , θ) = ∂ E[Yi ∣Xi ] = θk f (Xi′ θ), ∂Xik where f (⋅) denotes the pdf of F . 5 / 42 ⇒ In nonlinear models, the marginal effects are not constant! They depend on the values of each component of θ and on the data through f (Xi′ θ)! If Xik is discrete (e.g. a dummy), the marginal effect is usually calculated as the effect of a one-unit increase in the regressor on the conditional probability. Let Xi(−k) = (Xi1 , ..., Xik−1 , Xik+1 , ..., XiK )′ . Then the marginal effect of Xik for individual i is ′ mik (Xi , θ) = F ((Xik + 1)′ θk + Xit(−k) θ(−k) ) − F (Xi′ θ). Since f (Xi′ β) > 0, the sign of θk coincides with the sign of the marginal effect of Xik . However, the magnitude of the effect depends on Xi and θ. 6 / 42 On the other hand, relative marginal effects are constant: ∂E[Yi ∣Xi ] ∂E[Yi ∣Xi ] mik (Xi , θ) θk f (Xi′ θ) θk / = = = . ∂Xik ∂Xil mil (Xi , θ) θl f (Xi′ θ) θl It is also easy to find the upper bound of the magnitude as supz∈R f (z)θk . However, since the marginal effects in i depend on the data, we typically need to evaluate the marginal effects at some meaningful values. The most popular choice is probably the average partial effect (APE) 1 n AP E(Xik ) = ∑ mik (Xi , θ) n i=1 7 / 42 For the APE, the averaging is done “on the outside”. Other popular choices is to evaluate the data at the sample mean (or the median) x̄ to obtain the partial effect at the average (PEA) P EA = mik (x̄′ θ) = f ( 1 n ′ ∑ X θ)θk . n i=1 i To showcase the nonlinearity in the marginal effects, it is often interesting to report marginal effects for a range of values (e.g. marginal effects for each quintile of the data) or at values that are representative or of particular interest in a specific application. For estimation of APEs etc. we simply replace any unknown parameters θ by their maximum likelihood estimates θ̂. 8 / 42 Logit vs. probit The two most popular choices for the link function F are ● F (Xi′ θ) = Φ(Xi′ θ), where Φ is the standard normal cdf ⇒ Probit model ● F (Xi′ θ) = Λ(Xi′ θ), where Λ is the standard logistic cdf ⇒ Logit model There is surprisingly little guidance on how to choose between both models. In practice, it often boils down to computational aspects. Typically, the results for the APEs are pretty similar for both models, and the logit coefficient estimates are about 1.6 times as large as the probit coefficient estimates. ● The latter point is a crude rule of thumb: both pdfs attain their maximum at 0, and ϕ(0)/λ(0) ≈ 0.4/0.25 = 1.6. 9 / 42 Remark: Log-odds ratio in logit Let pi = Pr(Yi = 1∣Xi ). The odds are defined as odds = pi probability of success = ∈ [0, ∞) probability of failure 1 − pi The log-odds are thus given as log-odds = log ( pi ) ∈ (−∞, ∞). 1 − pi Assume that the log-odds are a linear function in Xi : 10 / 42 log ( ′ ′ ′ pi pi ) = Xi′ θ ⇔ = eXi θ ⇔ pi (1 + eXi θ ) = eXi θ 1 − pi 1 − pi ′ ⇔ pi = ′ eXi θ 1 e Xi θ = X′ θ = Λ(Xi′ θ). = ′ ′ Xi′ θ 1+e e i (1 + e−Xi θ ) 1 + e−Xi θ Thus, estimates from the logit model can be used to estimate the log-odds. Unfortunately, the same argument does not apply to probit. 11 / 42 Estimation Assume the following: Yi∗ = Xi′ θ0 + ϵi , where Yi∗ is latent Yi = 1{Yi∗ > 0} ϵi ∣Xi is i.i.d. across i with a distribution that is symmetric around 0 and a cdf F that is three times continuously differentiable. Using these assumptions, the marginal pmf of Yi given Xi is pmf (Yi ∣Xi ; θ) = F (Xi′ θ)Yi (1 − F (Xi′ θ))1−Yi I{0,1} (Yi ). By independence across i, the joint likelihood of Y1 , ..., Yn conditional on X1 , ..., Xn is n L(θ) = ∏ F (Xi′ θ)Yi (1 − F (Xi′ θ))1−Yi I{0,1} (Yi ). (3) i=1 12 / 42 The likelihood in (3) is called conditional likelihood since we condition on the observed characteristics. By fully specifying the distribution of ϵi we make the model fully parametric, as the only unknown is the parameter. We implicitly assume that all moments are correctly specified. In particular, we assume that the variance of ϵi conditional on Xi is correctly specified as well. 13 / 42 From (3) we obtain the (scaled) log-likelihood ℓ(θ) = 1 n ′ ′ ∑ Yi log(F (Xi θ)) + (1 − Yi ) log(1 − F (Xi θ)). n i=1 The likelihood is then maximized to derive the MLE θ̂. However, there is typically no closed-form for θ̂. We this have to rely on numerical maximization routines (e.g. Newton-Raphson). In logit and probit, the likelihood is strictly concave if EXi Xi′ > 0. This is good news for numerical optimizers, as the maximizer will truly be a global rather than a local maximizer. Let Fi = F (Xi′ θ) and fi = f (Xi′ θ). 14 / 42 The score is given by: ∂ 1 n Yi 1 − Yi ℓ(θ) = ∑ [ fi Xi − fi Xi ] ∂θ n i=1 Fi 1 − Fi = = 1 n yi (1 − Fi ) − (1 − Yi )Fi fi Xi ∑ n i=1 Fi (1 − Fi ) 1 n Yi − Fi fi Xi ∑ n i=1 Fi (1 − Fi ) The first order conditions ∂ ℓ(θ̂) ∂θ n Yi − F̂i i=1 F̂i (1 − F̂i ) ∑ = 0 are therefore given by: fˆi Xi = 0 ∈ RK , where F̂it = F (Xi′ θ̂) and fˆi (θ̂) = f (Xi′ θ̂). 15 / 42 Count data Count data variables take non-negative integer values and appear frequently in economic analysis. ● # patents granted ● # doctor visits ● # cigarettes smoked Typically, there does not exist a natural upper bound and the outcome is zero for some members of the population. Let Y a be count variable and X be a vector of explanatory variables. As usual, we are interested in modeling E[Y ∣X]. A linear model is often not appropriate, as it can produce negative predictions when Xi β̂ < 0. 16 / 42 In practice, some people try to account for the non-negativity of count data by using a log-transformation, i.e. they estimate log(Yi ) = Xi′ β + ϵi . Since often Yi = 0 for some i ∈ {1, ..., n}, people simply “add one to each Yi ”, leading to log(Yi + 1) = Xi′ β + ui . This approach has several downsides: ● Since the log() is non-linear, E[log(Yi + 1)∣Xi ] ≠ log(E[Yi ∣Xi ] + 1). ● log(Yi + 1) is discrete, so the usual %-change interpretation does not apply, as it works only for “small” changes. 17 / 42 It is often best to model E[Yi ∣Xi ] directly, taking into account the count nature of Yi . A popular choice is the Poisson model, i.e. assume that i.i.d. Yi ∣Xi ∼ Poisson(µ(Xi )), so that pdf Yi ∣Xi (k) = Pr(Yi = k∣Xi ) = e−µ(x) µ(x)k , k ∈ N0 . k! Typically, the mean µ(Xi ) is modeled as µ(Xi ) = eXi β ≥ 0. ′ 18 / 42 The (pooled) Poisson likelihood is (eXi β ) ′ n L(β) = ∏ e−e i=1 Xi′ β Yi ! Yi , so the (scaled) log-likelihood is ℓ(β) = 1 n X′ β ′ ∑ [−e i + Yi Xi β − log(Yi )] . n i=1 Since constants can be ignored in the likelihood maximization, the likelihood is typically written as ℓ(β) = 1 n ′ X′ β ∑ Yi Xi β − e i . n i=1 19 / 42 The FOC therefore is ′ ∂ℓ(β) 1 n ! = ∑ Xi (Yi − eXi β ) = 0 ∂β n i=1 and (4) 1 n Xi′ β ∂ 2 ℓ(β) = − Xi Xi′ < 0. ∑e ∂β∂β ′ n i=1 Thus, if EXi Xi′ as full rank, the likelihood is concave and the MLE β̂ is the unique maximizer of ℓ(β). 20 / 42 Interpretation of marginal effects Taking the derivative with respect to Xik yields ′ ∂ ∂ Xi′ β0 E[Yi ∣Xi ] = e = β0k eXi β0 . ∂Xik ∂Xik Thus, as in binary choice models, the marginal effect of Xik depends not only on β0 but also on the level of Xi . Notice however that ∂ ∂ E[Yi ∣Xi ]/E[Yi ∣Xi ] = log(E[Yi ∣Xi ]). ∂Xik ∂Xik Thus, 100 ⋅ β0k is the semi-elasticity of E[Yi ∣Xi ] w.r.t. Xik , i.e. a one unit change in Xik induces a 100 ⋅ β0k % change in E[Yi ∣Xi ]. 21 / 42 As in a linear model, let ϵ̂i = Yi − E[Yi ∣Xi ] = Yi − µ(Xi ) = Yi − eXi β . ′ Then, (4) implies n ∑ ϵ̂i Xi = 0, i=1 i.e. the regressors are orthogonal to the residuals in the sample. There is another interesting similarity between the linear model and the poisson model: as in the linear model, correct specification ′ of E[Yi ∣Xi ] = eXi β is sufficient for consistency of the poisson MLE. ● This is not obvious: typically maximum likelihood estimates are only consistent if the whole distribution is correctly specified. ● With the poisson model, it is thus much more common to estimate the coefficients with maximum likelihood but using a “robust” sandwich estimator for the standard errors. 22 / 42 Robust poisson standard errors Recall that if the poisson likelihood ℓ(β) is correctly specified (i.e. the data is truly generated according to a poisson model), √ d n(β̂ − β0 ) → N(0, F −1 ), where F = var0 (∂θ ℓ(θ0 )). Recall from our proof of asymptotic normality that if the information equality does not hold (for instance because the likelihood is incorrectly specified), √ d n(β̂ − β0 ) → N(0, H −1 F H −1 ), 2 ∂ where H = −E ∂β∂β ′ ℓ(β) is the Hessian. 23 / 42 Despite this robustness property, the poisson model also has downsides: As can easily be shown, E[Yi ∣Xi ] = µ(Xi ) = var[Yi ∣Xi ], a property that is known as equi-dispersion. This can be problematic if there are “excess-zeros”, for instance in the analysis of # patents where many firms have 0 patents. Assume that X ∼ Poisson(λ) and that Y satisfies Y ∣Y > 0 ∼ X∣X > 0, i.e. Y behaves like a Poisson variable for values larger than 0. If Pr(Y = 0) > Pr(X = 0), then var(Y ) > E(Y ) so that the poisson model is not appropriate. 24 / 42 Zero inflated poisson An alternative that allows us to explicitly address excess-zeros is the zero inflated poisson model. ● Other “zero-inflated”... models exist as well. The idea is that we combine a distribution that “generates the zeros” with a poisson distribution for the non-zero numbers: ⎧ ⎪ ⎪p + (1 − p)Poissonλ (0), Yi = 0 pdf(Yi ∣Xi ) = ⎨ ⎪ (1 − p)Poissonλ (Yi ), Yi > 0 ⎪ ⎩ where p is the probability of excess-zeros. As in the poisson model, λ = eXi β0 is a popular choice. ′ 25 / 42 Sample selection on observables Selection on observables Xi can sometimes be ignored: Assume that we have a cross-section of size n and define the indicator ⎧ ⎪ ⎪1, i stays in the sample si = ⎨ ⎪ 0, i is dropped from the sample ⎪ ⎩ . Further assume that the data is generated according to the linear model Yi = Xi′ β + ϵi . 26 / 42 The OLS estimator is n −1 n β̂ = (∑ si Xi Xi′ ) i=1 −1 n n ∑ si Xi Yi = β + (∑ si Xi Xi ) ′ i=1 i=1 ∑ si Xi ϵi . i=1 Assume that Xi is exogenous and EXi Xi′ has full rank in the population. For identification of β, we need to assume that E[si Xi Xi′ ∣si = 1] has full rank, (5) i.e. the usual rank condition holds on the selected subsample. Condition (5) may for instance be violated if our sample contains only females and we include a gender dummy. 27 / 42 For consistency of β̂ we need E[si Xi ϵi ] = 0. (6) Condition (6) is satisfied if s is independent of x and ϵ (“missing completely at random”), as E[si Xi ϵi ] = E[si ]E[Xi ϵi ] = 0 by exogeneity of Xi . A slightly weaker condition that is still sufficient for consistency is E[ϵi ∣Xi , si ] = 0, which is known as “exogenous sampling”. This assumption holds if for instance Xi is exogenous and si is a deterministic function of Xi . 28 / 42 For instance, let Yi = β0 + β1 Xi + ϵi where Xi captures wealth and assume that Xi is an exogenous regressor. Suppose we select only households with a certain level of wealth, i.e. ⎧ ⎪ ⎪1 if wealth ≥ 100K si = ⎨ . ⎪ 0 otherwise ⎪ ⎩ In that case, si = si (Xi ) and E[ϵi ∣Xi , si ] = E[ϵi ∣Xi , si (Xi )] = E[ϵi ∣Xi ] = 0. Intuitively, si (Xi ) does not contain any more information than Xi and is thus irrelevant as a conditioning variable. 29 / 42 Moreover, si is not allowed to be correlated with unobserved factors in the selected sample. For example, suppose we estimate wagei = β0 + β1 educationi + ϵi , where ϵi contains motivation. Assume for the moment that education and motivation are uncorrelated in the population. Thus, if we truly had a random sample from the population, OLS would be consistent. If we however have a random sample from current workers, the average motivation for workers with low levels of education is likely to be higher than in the population. 30 / 42 Workers with low education and with (less than) average motivation are less likely to be in our sample of workers. Thus, in our selected sample, a correlation between education and motivation exists that does not exist in the population, so that E[educationi ⋅ ϵi ] = 0 while E[educationi ⋅ ϵi ∣si = 1] ≠ 0. 31 / 42 Another example due to Elwert and Winship (2014): Suppose beauty and (acting) talent are uncorrelated in the population (which sounds plausible) and that both talent and beauty cause Hollywood success. Suppose we want to assess the effect of talent on wage using the model wagei = β0 + β1 talenti + ϵi , where ϵi contains beauty. If we had a random sample from the population, OLS would be consistent, as E[talent ⋅ beauty] = 0. Suppose however that we have a sample of Hollywood actors. In this sample, talent and beauty are correlated: any person without talent in our sample must be beautiful! Thus, E[talent ∗ beauty∣Hollywood success] ≠ 0. 32 / 42 Censoring and truncation While selection on the observables Xi can sometimes be ignored, selection on the response usually renders OLS biased. Specific examples where the data on the response is altered and OLS is inconsistent are censoring and truncation: Under censoring, the observed outcome is constrained while the covariates are accurately observed. For instance, income studies (where income is the response variable) often only report “income ≥ 100K. Thus, for a household with 200K income, we observe all covariates but the response is set to 100K. Under truncation, observations whose outcome variable is outside a certain range are excluded from the sample. 33 / 42 Censoring and truncation can be handled by maximum likelihood methods (e.g. Tobit models). Here, we will only discuss the Heckit procedure, also known as the two-step Heckman correction or the Heckman selection model. The Heckman correction is particularly popular in cases where we need to correct for bias arising from non-randomly selected samples. For his contributions, Heckman received the Nobel prize in Economics (together with Daniel McFadden) in 2000. With more than 200000 citations, Heckman is probably one of the most prominent Economists alive. 34 / 42 Let Yi∗ be a latent variable. The observed variable Y is called left-censored or censored from below at L if ⎧ ⎪ ⎪Y ∗ , Y ∗ > L Y =⎨ ⎪ L, Y ∗ ≤ L ⎪ ⎩ . Similarly, Y is called right-censored or censored from above at U if ⎧ ⎪ ⎪Y ∗ , Y ∗ < U Y =⎨ ⎪ U, Y ∗ ≥ U ⎪ ⎩ . 35 / 42 Consider the latent variable model Yi∗ = Xi′ β0 + ϵi , ϵi ∣Xi ∼ N(0, σ02 ), where we observe Yi = max(0, Yi∗ ). A naive way of dealing with censored data would be to use OLS on the subset of the data that is not affected by censoring. However, in our model, E[Yi ∣Xi , Yi > 0] = E[Yi∗ ∣Xi , Yi∗ > 0] = Xi′ β0 + E[ϵi ∣Xi , Yi∗ > 0] = Xi′ β + E[ϵi ∣Xi , ϵi > −Xi′ β0 ]. 36 / 42 Now, if X ∼ N(µ, σ 2 ), then E[X∣X > a] = µ + σλ ( where λ(x) = a−µ ), σ ϕ(x) 1 − Φ(x) is the inverse Mill’s ratio. Setting α0 = β0 /σ0 , E[Yi ∣Xi , Yi > 0] = Xi′ β0 + σ0 λ (− Xi′ β0 ) = Xi′ β0 + σ0 λ(−Xi′ α0 ). σ0 Thus, using linear regression on observations with Yi > 0 are used will lead to inconsistent estimates. 37 / 42 For Yi > 0, we can always (trivially) write Yi = E[Yi ∣Xi , Yi > 0] + ui , ui = Yi − E[Yi ∣Xi , Yi > 0], so that E[ui ∣Xi , Yi > 0] = 0 by construction. Our model for the data with Yi > 0 thus is Yi = Xi′ β0 + σ0 λ(−Xi′ α0 ) + ui . Using OLS in this model would yield consistent estimates for the parameters β0 and σ0 . Problem: The “data” λ(−Xi′ α0 ) is unobserved since α0 is unknown. 38 / 42 Solution: Use a first-stage probit model to estimate Pr(Yi∗ > 0): Pr(di = 1) = Pr(Yi∗ > 0) = Pr(Xi′ β0 > −ϵi ) = Pr ( ϵi Xi′ β0 < ) = Φ(Xi′ α0 ), σ0 σ0 where we used the symmetry of the distribution of ϵi in the second step. The likelihood for α can thus be written as n L(α) = ∏ Φ(Xi′ α)di (1 − Φ(Xi′ α))1−di . i=1 After estimating α̂ by maximum likelihood, use the subsample with Yi > 0 to run OLS on Yi = Xi′ β + σλ(−Xi′ α̂) + u∗i , u∗i = ui + probit error. 39 / 42 The heckit approach is also often used to address sample selection: Suppose we want to analyze the effect of variable x on wage. If the population of interest is “current workers”, a random sample of current workers yields unbiased results. If the population is “all people at working age”, our estimates are likely to suffer from sample selection bias. Sometimes the Heckit is also used to alleviate self-selection bias: if one wants to assess the effect of a voluntary job training on wage, participation is endogenous. By explicitly modeling the participation decision, we can get unbiased estimates, provided the participation equation is correctly specified. ● Since this often entails strong assumptions, many researchers nowadays rather rely on natural experiments (see DiD). 40 / 42 Assume that we have two latent variables Y1∗ = X1′ β1 + ϵ1 Y2∗ = X2′ β2 + ϵ2 and assume that ϵ 0 1 ( 1 ) ∼ N (( ) , ( ϵ2 0 σ12 σ12 )) . σ22 The participation equation is ⎧ ⎪ ⎪1, Y1∗ > 0 d=⎨ ⎪ 0, else ⎪ ⎩ and the outcome equation is ⎧ ⎪ ⎪Y ∗ , Y =⎨ 2 ⎪ 0, ⎪ ⎩ d=1 . d=0 41 / 42 Suppose we only used the observations with d = 1 ⇔ Y1∗ > 0. Then E[Y ∣Xi , X2 , d = 1] = E[Y2∗ ∣Xi , X2 , Y1∗ > 0] = X2′ β2 + σ12 λ(−X1′ β1 ). The heckit procedure then estimates β1 using a probit model using observations with d ∈ {0, 1}. In the second step, restrict the sample to observations with d = 1 and run OLS of Y on X2 and λ(−X1′ β̂1 ). While very elegant, the heckit procedure is sensitive to the distributional assumptions (and thus currently a bit out of fashion). 42 / 42