Thursday, September 30th. Constructing a Likelihood Function I. II. III. A Survey of Stochastic Components The Likelihood Function for a Normal Variable Summarizing a Likelihood Function V. A Survey of Stochastic Components A. Approach and Notation. King’s Chapter 3 looks at many possible forms that the stochastic component of one observation yi of the random variable Yi could take, while Chapter 4 introduces the systematic component of yi as it varies over N observations. Let S be the sample space, the set of all possible events, and let zki be one event. This even is a set of outcomes of type k. Let yji be a real number representing one possible outcome of an experiment. B. Useful Axioms of Probability. These tell us what the univariate probability distributions that we will survey look like in general. i. For any event zki, Pr(zki) is > or = 0. ii. Pr(S)=1.0 iii. If z1i,… zki are k mutually exclusive events, then Pr(z1i U z2i …U zki) = Pr(z1i) + Pr(z2i) +… Pr(zki) B. Results for stochastically independent random variables. These will be most useful when we construct likelihood functions. i. Pr(Y1i, Y2i) = Pr(Y1i)Pr(Y2i) ii. Pr(Y1i|Y2i) = Pr(Y1i) C. What is a Univariate Probability Distribution? It is a complete accounting of the Pr that Yi takes on any particular value yi. For discrete outcomes, you can write out (and graph) a probability mass function (pmf). For continuous outcomes, you can write out or graph a probability density function (pdf). Each function is derived from substantive assumptions about the underlying “data generating process.” You can develop your own distribution, or you select one of the many functions off the shelf that King surveys from, such as: i. The Bernoulli Distribution. This is used when a variable can only take on two mutually exclusive and exhaustive outcomes, such as a two party election where either one party wins or the other does. The algebraic representation of the pmf incorporates a systematic component that measures how fair the die was that determines which outcome takes place. ii. The Binomial Distribution. The process that generates this sort of data is a set of many Bernoulli random variables, and we observe their sum. It could be how many heads you get in six flips of a coin, or how many times a person voted over six elections. It requires the assumption that the Bernoulli trials are “i.i.d.,” or independent and identically distributed. This means that the coin (or person) has no iii. iv. v. vi. memory, and that the probability of getting heads (or voting) is the same in each trial. Extended Beta Binomial. This relaxes the i.i.d. assumption of the Binomial Distribution, and could be useful for looking at yes or no votes cast by 100 Senators. Poisson. For a count with no upper bound, when the occurrence of one event has no influence on the expected # of subsequent events. Negative Binomial. Just like the Poisson, but the rate of occurrence of event varies according to the gamma distribution. Normal Distribution. This is a continuous variable, where the disturbance term is the sum of large number of independent but unobserved factors. A possible substantive example is presidential approval over time. The random variable is symmetric, and has events with nonzero probability occurring everywhere. This means that a normal distribution cannot generate a random variable that is discrete, skewed, or bounded. Its pdf can be written out as: fN (yi |μ,σ) = (2πσ2)-½ exp[-(yi-μ)2/2σ2] vii. where π=3.14 and exp[a]=ea Log-Normal Distribution. It is like the Normal Distribution, but with no negative values. VI. The Likelihood Function for a Normal Variable A. Begin by writing out the stochastic component in its functional form (the pdf that returns the probability of getting yi in any single observation, given μi). This is the traditional probability, and it is proportional to the likelihood. fN (yi |μ,σ) = (2πσ2)-½ exp[-(yi-μ)2/2σ2] B. If we can assume that there is stochastic independence across our observations (no autocorrelation), we can use the Pr(Y1i, Y2i) = Pr(Y1i)Pr(Y2i) rule to build a joint probability distribution for our observations: f(y|μ) = Π fn(yi|μi) f(y|μ) = Π (2πσ2)-½ exp[-(yi- μi)2/2σ2] C. In the next step, called “reparameterization,” we substitute in a systematic component for our generic parameter. In this case, we substitute a linear function. f(y|μ) = Π (2πσ2)-½ exp[-(yi- xiβ)2/2σ2] E. Now we take this traditional probability, which returns an absolute probability, and use the likelihood axiom to get something that is proportional to the inverse probability. We also want work with an expression that is mathematically tractable, and since any monotonic function of the traditional probability can serve as a relative measure of likelihood, for convenience we will take the natural log of this function. We can also use the “Fisher-Neyman Factorization Lemma,” which proves that in a likelihood function, we can drop every term not depending on the parameters, to get rid of k(y). Finally, we are also going to use algebraic tricks like ln(abc)=ln(a)+ln(b)+ln(c) and ln(ab)=bln(a). ~ ~ L( , ~ 2 | y ) k ( y ) Pr( y | , ~ 2 ) n ~ ~ L( , ~ 2 | y ) k ( y ) f n ( yi | , ~ 2 ) i 1 n ~ ~ L( , ~ 2 | y ) f n ( yi | , ~ 2 ) i 1 ~ n ( y i xi ) 2 ~ ~2 2 1 / 2 ~ L( , | y ) ( 2 ) exp[ ] 2~ 2 i 1 ~ ( y i xi ) 2 n ~ ~2 2 1 / 2 ~ ln L( , | y ) i 1 ln{( 2 ) exp[ ]} 2~ 2 ~ ( y i xi ) 2 1 n ~ ~2 2 ~ ln L( , | y ) i 1{ ln( 2 ) } 2 2~ 2 1 1 1 n ~ ~ ln L( , ~ 2 | y ) i 1{ ln( 2 ) ln( ~ 2 ) ~ 2 ( yi xi ) 2 } 2 2 2 1 1 n ~ ~ ln L( , ~ 2 | y ) i 1{ ln( ~ 2 ) ~ 2 ( yi xi ) 2 } 2 2 VII. Summarizing a Likelihood Function A. This log likelihood function is an expression representing a function that could be graphed, but you would need as many dimensions as: the likelihood value + constant term + # of independent variables + # of ancillary parameters like σ2 (we have assumed homeskedacity in this model, making σ2 constant, but we could have chosen to model its variation across observations). So instead of using all of the information in the function, we will summarize the function by finding its maximum, the value of β that gives us the greatest likelihood of having generated the data. B. Analytical Method. For relatively simple likelihood functions, we can find a maximum by going through the following four steps. i. Take the derivative of the log-likelihood with respect to your parameter vector. The reason that we took the log of the likelihood function is mainly because taking the derivative of a sum is much easier than taking the derivative of a product. ii. Set the derivative equal to zero. iii. Solve for your parameter, thus finding an extreme point. iv. Find out whether this extreme point is a maxima or minima by taking the second derivative of the log-likelihood function. If it is negative, the function bows downward before and after the extreme point and you have a (possible local) maximum. For our linear model, the analytical solution for the variance parameter is: σ2= 1/nΣ(yi- xiβ)2 which should look familiar! You are trying to minimize the squared error here, and thus OLS can be justified by maximum likelihood as well as by the fact that it is a convenient way to summarize a relationship and that it has all of the properties that are desirable in an estimator. C. Numerical/Computations Methods. This is what Stata does, because some likelihood functions do not have an analytical solution. You can write out a likelihood function and then begin with a starting value (or vector of values) for the parameter of interest. Then you can use an algorithm (a recipe for a repeated process) to try out better and better combinations of parameter values until you maximize the likelihood. The Newton-Raphsom and Gauss-Newton algorithms are common ones, and they use linear algebra to take derivatives of matrices. Basically, they start with a parameter vector, get its likelihood, then look at the gradient of the likelihood function to see which direction they should move in order to get a higher likelihood value, and keep going until they can’t move in either direction and get a higher likelihood.