Maximum Likelihood in Concept and Practice (cribbed mostly from Gary King’s Unifying Political Methodology) Monday, March 29: The Goals and Foundations of Maximum Likelihood I. II. III. IV. Definitions and Notation The Linear Model in a General Form Why Inverse Probability Doesn’t Work Why the Likelihood Model of Inference Works I. Definitions and Notation A. Developed by statistician R.A. Fisher in the 1920s, borrowed first by economists, and finally imported into political science, maximum likelihood provides a fundamental rationale for using appropriate estimators and gives us lots of flexibility to match our statistical models to the process that we think generated our data. There are a number or overarching approaches designed to unify statistical methodologies, but this is by far the most familiar to political scientists. It requires you to make explicit choices about how you think your dependent variable is distributed (the model’s stochastic component) and what the relationship is between your independent and dependent variables (the model’s systematic component). Then, based on some basic rules of probability, it teaches you how to write out a likelihood function and find the set of parameters most likely to have generated observed data, given your assumed model. Before we learn the step-by-step process of getting maximum likelihood estimate, we need to learn a bit of notation. B. Let Yi be a “random variable.” It is random in that there is stochastic variation in it across many experiments for a single observation, and a variable in that it varies across observations in a single experiment. Let yi be one draw from the random variable. Let xi be one draw (consisting of one or more explanatory factors) from the social system X. C. Hypothesize some model M about how the social system produces the random variable. We can partition this model into M*, the part of the model that we will assume, and θ, the part of the model composed of parameters that we will estimate. A fully “restrictive” model has all of its assumptions specified (it is all M* and no θ); it is the most parsimonious model and omits all of the variables. An “unrestrictive” model estimates everything, is all θ and no M*, and is more interesting but demands more from the data. In hypothesis testing, we will often compare a fairly unrestrictive model to a slightly more restrictive model. II. The Linear Model in a General Form A. You should be familiar with this way of writing out an OLS regression: 1 Yi xi i systematic component stochastic or random component King refers to it as the linear normal regression model, breaking it down into its linear systematic component and its normally distributed stochastic component. The stochastic component’s distribution is given by εi ~ fn(ei|0, σ2). This should be read as “the errors are distributed normally with a mean of zero and a variance of σ2,” which is elsewhere written as εi ~ N(0, σ2). B. A more general way of writing out an identical linear normal model is: Yi ~ fn(yi|μi, σ2) where μi = xiβ Note that this expression models the randomness in Yi directly, rather than through εi. Econometrics textbook writers like Goldberger show that this assumption of normality in the distribution of Yi around its expected value is equivalent to assuming that the errors are normally distributed around zero. King uses this style of presentation because the maximum likelihood process requires you to make a substantive assumption about how the data are generated (and thus distributed), and thinking about the dependent variable itself is usually more natural than thinking about its errors. i. The systematic component in the expression above is a statement of how θi varies over observations as a function of a vector of explanatory variables. It says that xi and Yi are “parametrically related” through E(Yi) = μi = xiβ. It can be written in a general functional form as θ=g(X,β). ii. The stochastic component should not be viewed merely as an annoyance, but as an expression that contains substantive information. It can be written generally as Yi ~ fi(yi|θi,αi) where θ is the vector of parameters of interest, like μi in the linear case, and α is the vector of ancillary parameters, like σ2 in the linear case. III. Why Inverse Probability Doesn’t Work A. Wouldn’t it be great if we could determine the absolute probability of some parameter vector θ, given our data y and model M*? If we could do that, we could conduct a poll, assume that the variables that we didn’t measure are irrelevant, and then make statements like, “This is a 0.8237 probability that the effect of getting a PhD on your expected annual income is -$32,689.” This would be an inverse probability statement, and for a while was the holy grail of statistics. It can be formalized as Pr(θ|y, M*), though because M* is assumed, it is usually suppressed and an inverse probability is written as Pr(θ|y). 2 B. Using some basic rules of probability, we can see what we would need to calculate in order to calculate an inverse probability: Pr(θ|y) = Pr(θ,y) / Pr(y) by the rule that Pr(a|b)=Pr(a,b)/Pr(b) Pr(θ|y) = Pr(θ)Pr(y|θ) / Pr(y) by substituting Pr(θ,y)=Pr(y,θ)=Pr(y|θ)Pr(θ) This is “Bayes Theorem,” and statisticians thought it would give them a way to calculate an inverse probability. It is possible to calculate Pr(y|θ), which the probability of observing your data given a hypothesized parameter vector, and referred to as the “traditional probability.” We can put Pr(y) in terms of Pr(y|θ) and Pr(θ), but this leaves us with the tricky Pr(θ), one’s prior belief about θ. There is a raging debate in statistics between the “frequentists,” who define probability as the relative frequency of an event in hypothetical repetitions of an experiment, and those who say probability is only the subjective judgment of individuals. But no matter which camp you come from, you cannot use Pr(θ) to assign an absolute value to the inverse probability. IV. Why the Likelihood Model of Inference Works A. Without a method for calculating absolute inverse probabilities, we must be content with relative measures of uncertainty. This is what the likelihood model of inference ~ gives us. Now we will let be hypothetical values of the single, unobserved true value θ and let ˆ be the point estimator for it. We can write out the “Likelihood Axiom:” ~ ~ L( | y, M *) L( | y ) ~ ~ L( | y ) k ( y ) Pr( y | ) ~ ~ L( | y ) Pr( y | ) In this axiom, k(y) can be treated as a constant, because it is an unknown function of the data, which makes the likelihood of the true parameter given the data only proportional to the traditional probability. For a given set of observed data, k(y) is the same over many ~ possible values of . It varies, though, with different datasets, and this is what makes likelihood statements relative (relative to the likelihoods of other parameters given the same ~ dataset). L( |y) is the likelihood of a hypothetical model having generated the data, assuming M*. It can take on any value, and you can only compare likelihoods for the same dataset. ~ B. A “likelihood function” summarizes θ, allowing us to plot values of by the likelihood of each, given the data. “Maximum likelihood estimation” is a theory of point estimation that finds the maximum value of the likelihood function. 3 Wednesday, March 31st. Constructing a Likelihood Function V. A Survey of Stochastic Components VI. The Likelihood Function for a Normal Variable VII. Summarizing a Likelihood Function V. A Survey of Stochastic Components A. Approach and Notation. King’s Chapter 3 looks at many possible forms that the stochastic component of one observation yi of the random variable Yi could take, while Chapter 4 introduces the systematic component of yi as it varies over N observations. Let S be the sample space, the set of all possible events, and let zki be one event. This even is a set of outcomes of type k. Let yji be a real number representing one possible outcome of an experiment. B. Useful Axioms of Probability. These tell us what the univariate probability distributions that we will survey look like in general. i. For any event zki, Pr(zki) is > or = 0. ii. Pr(S)=1.0 iii. If z1i,… zki are k mutually exclusive events, then Pr(z1i U z2i …U zki) = Pr(z1i) + Pr(z2i) +… Pr(zki) C. Results for stochastically independent random variables. These will be most useful when we construct likelihood functions. i. Pr(Y1i, Y2i) = Pr(Y1i)Pr(Y2i) ii. Pr(Y1i|Y2i) = Pr(Y1i) D. What is a Univariate Probability Distribution? It is a complete accounting of the Pr that Yi takes on any particular value yi. For discrete outcomes, you can write out (and graph) a probability mass function (pmf). For continuous outcomes, you can write out or graph a probability density function (pdf). Each function is derived from substantive assumptions about the underlying “data generating process.” You can develop your own distribution, or you select one of the many functions off the shelf that King surveys from, such as: i. The Bernoulli Distribution. This is used when a variable can only take on two mutually exclusive and exhaustive outcomes, such as a two party election where either one party wins or the other does. The algebraic representation of the pmf incorporates a systematic component that measures how fair the die was that determines which outcome takes place. ii. The Binomial Distribution. The process that generates this sort of data is a set of many Bernoulli random variables, and we observe their sum. It could be how many heads you get in six flips of a coin, or how many times a person voted over six elections. It requires the assumption that the Bernoulli trials are “i.i.d.,” or independent and identically distributed. This means that the coin (or person) has no 4 iii. iv. v. vi. memory, and that the probability of getting heads (or voting) is the same in each trial. Extended Beta Binomial. This relaxes the i.i.d. assumption of the Binomial Distribution, and could be useful for looking at yes or no votes cast by 100 Senators. Poisson. For a count with no upper bound, when the occurrence of one event has no influence on the expected # of subsequent events. Negative Binomial. Just like the Poisson, but the rate of occurrence of event varies according to the gamma distribution. Normal Distribution. This is a continuous variable, where the disturbance term is the sum of large number of independent but unobserved factors. A possible substantive example is presidential approval over time. The random variable is symmetric, and has events with nonzero probability occurring everywhere. This means that a normal distribution cannot generate a random variable that is discrete, skewed, or bounded. Its pdf can be written out as: fN (yi |μ,σ) = (2πσ2)-½ exp[-(yi-μ)2/2σ2] vii. where π=3.14 and exp[a]=ea Log-Normal Distribution. It is like the Normal Distribution, but with no negative values. VI. The Likelihood Function for a Normal Variable A. Begin by writing out the stochastic component in its functional form (the pdf that returns the probability of getting yi in any single observation, given μi). This is the traditional probability, and it is proportional to the likelihood. fN (yi |μ,σ) = (2πσ2)-½ exp[-(yi-μ)2/2σ2] B. If we can assume that there is stochastic independence across our observations (no autocorrelation), we can use the Pr(Y1i, Y2i) = Pr(Y1i)Pr(Y2i) rule to build a joint probability distribution for our observations: f(y|μ) = Π fn(yi|μi) f(y|μ) = Π (2πσ2)-½ exp[-(yi- μi)2/2σ2] C. In the next step, called “reparameterization,” we substitute in a systematic component for our generic parameter. In this case, we substitute a linear function. f(y|μ) = Π (2πσ2)-½ exp[-(yi- xiβ)2/2σ2] E. Now we take this traditional probability, which returns an absolute probability, and use the likelihood axiom to get something that is proportional to the inverse probability. We also want work with an expression that is mathematically tractable, and since any monotonic function of the traditional probability can serve as a relative measure of likelihood, for convenience we will take the natural log of this function. We can also use the “Fisher-Neyman Factorization Lemma,” which proves that in a likelihood function, we can drop every term not depending on the parameters, to get rid of k(y). Finally, we are also going to use algebraic tricks like ln(abc)=ln(a)+ln(b)+ln(c) and ln(ab)=bln(a). 5 ~ ~ L( ,~ 2 | y ) k ( y ) Pr( y | , ~ 2 ) n ~ ~ L( ,~ 2 | y ) k ( y ) f n ( yi | ,~ 2 ) i 1 n ~ ~ L( ,~ 2 | y ) f n ( yi | ,~ 2 ) i 1 ~ n ( y i xi ) 2 ~ ~2 2 1 / 2 ~ L( , | y ) ( 2 ) exp[ ] 2~ 2 i 1 ~ ( y i xi ) 2 n ~ ~2 2 1 / 2 ~ ln L( , | y ) i 1 ln{( 2 ) exp[ ]} 2~ 2 ~ ( y i xi ) 2 1 n ~ ~2 2 ~ ln L( , | y ) i 1{ ln( 2 ) } 2 2~ 2 1 1 1 n ~ ~ ln L( , ~ 2 | y ) i 1{ ln( 2 ) ln( ~ 2 ) ~ 2 ( yi xi ) 2 } 2 2 2 1 1 n ~ ~ ln L( , ~ 2 | y ) i 1{ ln( ~ 2 ) ~ 2 ( yi xi ) 2 } 2 2 VII. Summarizing a Likelihood Function A. This log likelihood function is an expression representing a function that could be graphed, but you would need as many dimensions as: the likelihood value + constant term + # of independent variables + # of ancillary parameters like σ2 (we have assumed homeskedacity in this model, making σ2 constant, but we could have chosen to model its variation across observations). So instead of using all of the information in the function, we will summarize the function by finding its maximum, the value of β that gives us the greatest likelihood of having generated the data. B. Analytical Method. For relatively simple likelihood functions, we can find a maximum by going through the following four steps. i. Take the derivative of the log-likelihood with respect to your parameter vector. The reason that we took the log of the likelihood function is mainly because taking the derivative of a sum is much easier than taking the derivative of a product. ii. Set the derivative equal to zero. iii. Solve for your parameter, thus finding an extreme point. iv. Find out whether this extreme point is a maxima or minima by taking the second derivative of the log-likelihood function. If it is negative, the function bows downward before and after the extreme point and you have a (possible local) maximum. For our linear model, the analytical solution for the variance parameter is: σ2= 1/nΣ(yi- xiβ)2 which should look familiar! You are trying to minimize the squared error here, and thus OLS can be justified by maximum likelihood as well as by the fact that it is a convenient way to summarize a relationship and that it has all of the properties that are desirable in an estimator. 6 C. Numerical/Computations Methods. This is what Stata does, because some likelihood functions do not have an analytical solution. You can write out a likelihood function and then begin with a starting value (or vector of values) for the parameter of interest. Then you can use an algorithm (a recipe for a repeated process) to try out better and better combinations of parameter values until you maximize the likelihood. The Newton-Raphsom and Gauss-Newton algorithms are common ones, and they use linear algebra to take derivatives of matrices. Basically, they start with a parameter vector, get its likelihood, then look at the gradient of the likelihood function to see which direction they should move in order to get a higher likelihood value, and keep going until they can’t move in either direction and get a higher likelihood. Monday, April 5. What Good is a Likelihood Function? VIII. IX. X. XI. Properties of Maximum Likelihood Estimators Likelihood Ratio Test Interpreting Functional Forms (The Hard Way) Interpreting Functional Forms (The Easy Way) VIII. Properties of Maximum Likelihood Estimators A. Gary King prefers justifying maximum likelihood based on its deep philosophical justifications, but statisticians have traditionally justified estimators based on their properties or criteria. Here are some of the basic properties, (and note that maximum likelihood estimators don’t always possess all of them): B. Finite Sample Properties: i. Invariance to Reparameterization. You can “trick” maximum likelihood by estimating β, and then taking the natural log of β in order to estimate ln(β). ii. Invariance to Sampling (Size) Plans. ML estimators don’t depend on the sampling size rule, so it’s OK if you run out of dissertation funding and have to collect a smaller dataset. iii. Minimum Variance Unbiasedness. If a minimum variance unbiased estimator exists, then ML picks it. You may have seen a proof in an earlier class that a least squares estimators is a MVUE, and you should be comforted by the fact that ML picked it, given assumptions of linearity and normally distributed errors. C. Asymptotic Properties: (these look at the properties of more and more ˆ s estimated from datasets with larger and larger ns) i. Consistency. As n goes to infinity, the sampling distribution of an estimator ˆn converges to a spike over the true θ. ML estimators can violate consistency when you want to estimate as many parameters as 7 ii. iii. you have cases, but in these rare instances, no other estimator is consistent. Asymptotic Normality. For a very large n, the sampling distribution of ˆn is normal. Asymptotic Efficiency. An ML estimator has a smaller asymptotic variance than any consistent and uniformly Normal estimator. IX. Likelihood Ratio Test A. Since maximum likelihood is a relative concept, we are going to have ot compare hypotheses about the same data in order to judge the precision of estimates. Specifically, we can compare an unrestrictive model to a more restrictive model representing the null hypothesis that some parameter is fixed at zero (meaning that the variable is omitted). We can do this in three ways: i. Wald’s test corresponds to using standard errors of coefficients. (Greene, page 486) ii. A Rao’s Score/Lagrange Multiplier test uses only the null model. (Greene, page 489) iii. A Likelihood Ratio Test compares both model’s likelihoods. (Greene, page 484) B. “Likelihood Ratios.” In the same dataset, these allow us to compare the likelihoods of two hypothetical values of the parameters in the same units as the corresponding traditional probabilities, using simple math. This allows us to conduct hypothesis tests. The likelihood ratio is: ~ ~ L(1 | y ) k ( y ) Pr( y | 1 ) ~ ~ L( 2 | y ) k ( y ) Pr( y | 2 ) ~ ~ L(1 | y ) Pr( y | 1 ) ~ ~ L( 2 | y ) Pr( y | 2 ) D. The Likelihood Ratio Test. Let L* be the maximum of the likelihood function of the unrestrictive model representing the alternative hypothesis. Let L*R be the maximum of the likelihood function of the restrictive model representing the null hypothesis. We know that L* is greater than or equal to L*R, because using an additional explanatory variable cannot hurt your model. The question is (as in any hypothesis test), how do we know that the improvement in the likelihood that we get by adding this variable is sufficiently large that it is not due to chance alone? We rely on a result from distribution theory: Likelihood Ratio R = -2ln(L*R/L*) R = 2(ln(L*) – ln(L*R)) and this is distributed chi-square with df=m This likelihood ratio is distributed according to the chi-square distribution with m degrees of freedom, where m is the number of parameters. The expected value of R is m, so if you get an R that is bigger than the number of parameters, you should probably reject the null and adopt the unrestricted model. You can look at a chi-square table with m degrees of freedom to find the probability of getting a given R by chance alone, assuming that the null 8 hypothesis is true. [Note that this is a traditional probability statement, and we are able to assign an absolute probability here]. X. Interpreting Functional Forms (The Hard Way) A. Here’s the axe that Gary King has so profitably ground: “If β has no substantive interpretation for a particular model, then that model should not be estimated. P. 102)” He is reacting to the previously common practice of reporting coefficients from ML models like probit and logit, which are not obviously intuitive, discussing their sign and significance, and then throwing up your hands at how to make sense of the coefficient’s point value. Of course, he really isn’t advising anyone not to estimate any models. And his solution to this problem is not to run OLS, with its easy-to-interpret coefficients, but to run the ML model that fits your data generating assumptions, and then do some work in order to interpret them. In the bad old days before CLARIFY, you had to get out your calculator to do this. We are going to apply the logit function to discrete, stochastically independent outcomes and learn how to make meaningful statements about ML coefficients. B. Suppose you are trying to explain a phenomenon that can only result in two outcomes. Rather than selecting a continuous univariate probability distribution (like the Normal) to model its stochastic component, you’d want to select a discrete distribution with two outcomes like the Bernoulli distribution. Remember that if yi=0,1, the probability of the outcome given some parameter πi is given by: πiyi(1- πi)1-yi and takes on a value of zero elsewhere. Note that πi is really just the probability that your variable takes on the value of one, and thus πi must be some number between zero and one. We also need to supply a systematic component here, something that tells us how πi, the chances of a particular outcome, varies across observations. If we used a linear systematic component, it would return values of π that were potentially less than zero or greater than one. We also might want a systematic component that represents a much larger marginal effect of an explanatory variable when it varies in the middle of its range, but smaller effects at the bottom and top ends of its range. A systematic component that fits this substantive story is the logit functional form, substantively similar to probit but more tractable mathematically. πi =1/[1+exp(-xi β)] If we can assume that all of our observations are generated by independent Bernoulli processes, we write out a joint distribution, and (after omitted algebraic steps) turn it into a log likelihood function. n Pr(Y | ) iy i (1 i )1 yi i 1 n ~ ~ ~ ln L( | y ) { yi ln[ 1 exp( xi )] (1 yi ) ln[ 1 exp( xi )]} i 1 9 C. Interpreting coefficients. One reason this is not as simple as in the linear case is because the effect of an explanatory variable can depend on its level and upon the level of other explanatory variables. Because effects vary in the case of logit, to isolate the impact of one variable, we have to hold the other variables constant at some value. Holding variables constant at their means is one intuitive choice. If some of these explanatory variables are categorical or dichotomous, you might want to hold them constant at their median or modal values. Once you do this, you can report the effects of changes in your key explanatory variables in one of three ways: i. Graphical Methods. You can graph the predicted probability of one outcome, π, by different values Xji of variable Xj by using the following formula (where X* is the vector of all other k-1 explanatory variables): ˆ 1 1 exp[ X *ˆ * X ji j ] ii. Fitted Values. You can plug in just a few key values of the key explanatory variable, run them through the equation above, and report the predicted probability of one outcome. iii. First Differences. This tells you how much a parameter like π changes due to a change in your key explanatory variable. To compute it, you subtract the fitted value at point Xja from the fitted value at point Xjb: 1 1 FirstDifference * ˆ* * ˆ* 1 exp[ X X jb j ] 1 exp[ X X ja j ] XII. Interpreting Functional Forms (The Easy Way) A. All it takes is a set of coefficients, descriptive measures of your variables, and a calculator to find these first differences. But we are lazy political scientists, and this laziness led many researchers to stop after getting the coefficients. So 11 years after writing UPM, Gary King, Michael Tomz, and Jason Wittenberg wrote a very useful program called CLARIFY that works inside of Stata to calculate things like first differences for us. You can go to http://GKing.harvard.edu, watch Gary’s face get assembled, and download it. Be sure to get the documentation as well. Clarify estimates an ML model and then simulates 1000 vectors of parameters (rather than basing it confidence intervals on the standard errors of coefficients). i. In order to do this, just type estsimp at the beginning of the Stata command line that you would normally enter to run an ML model (i.e. estsimp logit y x1 x2). ii. The next step asks you to use the setx command to hold the explanatory variables constant at some value, such as (i.e. setx mean). iii. Finally, use the simqi command to simulate some quantity of interest, conditional on how you have set the explanatory variables. For instance, simqi fd(predval(1)) changex(x1 67 263) asks it to simulate the change in the probability that Y=1 brought by an increase in variable x1 from a value of 67 to a value of 263. 10 We will go over how to use CLARIFY in a lab, but here is an example from some of my research of what Stata/CLARIFY output and my notes on it look like, as well as a table that presents this information. Stata commands are in bold. estsimp mlogit exit adsalary totalday staffper ptedge ptloss turn1n leaddem caucus house size appoint money k6 setx mean For each outcome, I move continuous variables from one standard deviation below their mean to one above, while holding other variables constant at their mean, and predict the effect. For dichotomous variables, I simulate the effects of moving from zero to one. Effects on the Chances of Losing Power Even Though Party Retains Control Salaries: $2307 North Dakota, 1992 $24374 Illinois, 1992 simqi fd(prval(1)) changex(adsalary 2307 24374) First Difference: adsalary 2307 24374 Quantity of Interest | Mean Std. Err. [95% Conf. Interval] ---------------------------+-------------------------------------------------dPr(exit2 = 2) | -.001879 .0033777 -.0090399 .0043757 Session Lengths: simqi fd(prval(1)) changex(totalday 67 263) First Difference: totalday 67 263 Quantity of Interest | Mean Std. Err. [95% Conf. Interval] ---------------------------+-------------------------------------------------dPr(exit = 1) | -.0972276 .0079422 -.1131462 -.0815021 Turnover in the Subsequent Session: simqi fd(prval(1)) changex(turn1n 12 34) First Difference: turn1n 12 34 Quantity of Interest | Mean Std. Err. [95% Conf. Interval] ---------------------------+-------------------------------------------------dPr(exit = 1) | .129404 .0089526 .1123719 .1465948 Appointment Power: simqi fd(prval(1)) changex(appoint 0 1) First Difference: appoint 0 1 Quantity of Interest | Mean Std. Err. [95% Conf. Interval] ---------------------------+-------------------------------------------------dPr(exit = 1) | -.0887804 .0079899 -.1038538 -.072931 11 Size of House: simqi fd(prval(1)) changex(size 51 180) First Difference: size 51 180 Quantity of Interest | Mean Std. Err. [95% Conf. Interval] ---------------------------+-------------------------------------------------dPr(exit = 1) | -.1348364 .0071777 -.1484254 -.1201581 Table 3.5. Effects of Significant Predictors of the Probability that Leader Loses Power Even Though Party Retains Control. Variable Shift in Variable (from, to) Shift in the Probability of Losing Power Session Length (67, 263) 10% lower Turnover Rate Size of House (in members) Committee Appointment Power (12%, 34%) 13% higher (51, 180) 13% lower (0, 1) 9% lower 12