Thursday, September 30th

advertisement
Thursday, September 30th. Constructing a Likelihood Function
I.
II.
III.
A Survey of Stochastic Components
The Likelihood Function for a Normal Variable
Summarizing a Likelihood Function
V. A Survey of Stochastic Components
A. Approach and Notation. King’s Chapter 3 looks at many possible forms that the
stochastic component of one observation yi of the random variable Yi could take, while
Chapter 4 introduces the systematic component of yi as it varies over N observations. Let S
be the sample space, the set of all possible events, and let zki be one event. This even is a set
of outcomes of type k. Let yji be a real number representing one possible outcome of an
experiment.
B. Useful Axioms of Probability. These tell us what the univariate probability
distributions that we will survey look like in general.
i.
For any event zki, Pr(zki) is > or = 0.
ii.
Pr(S)=1.0
iii.
If z1i,… zki are k mutually exclusive events, then
Pr(z1i U z2i …U zki) = Pr(z1i) + Pr(z2i) +… Pr(zki)
B. Results for stochastically independent random variables. These will be most
useful when we construct likelihood functions.
i.
Pr(Y1i, Y2i) = Pr(Y1i)Pr(Y2i)
ii.
Pr(Y1i|Y2i) = Pr(Y1i)
C. What is a Univariate Probability Distribution? It is a complete accounting of the
Pr that Yi takes on any particular value yi. For discrete outcomes, you can write
out (and graph) a probability mass function (pmf). For continuous outcomes,
you can write out or graph a probability density function (pdf). Each function is
derived from substantive assumptions about the underlying “data generating
process.” You can develop your own distribution, or you select one of the many
functions off the shelf that King surveys from, such as:
i.
The Bernoulli Distribution. This is used when a variable can only
take on two mutually exclusive and exhaustive outcomes, such as a
two party election where either one party wins or the other does.
The algebraic representation of the pmf incorporates a systematic
component that measures how fair the die was that determines which
outcome takes place.
ii.
The Binomial Distribution. The process that generates this sort of
data is a set of many Bernoulli random variables, and we observe
their sum. It could be how many heads you get in six flips of a coin,
or how many times a person voted over six elections. It requires the
assumption that the Bernoulli trials are “i.i.d.,” or independent and
identically distributed. This means that the coin (or person) has no
iii.
iv.
v.
vi.
memory, and that the probability of getting heads (or voting) is the
same in each trial.
Extended Beta Binomial. This relaxes the i.i.d. assumption of the
Binomial Distribution, and could be useful for looking at yes or no
votes cast by 100 Senators.
Poisson. For a count with no upper bound, when the occurrence of
one event has no influence on the expected # of subsequent events.
Negative Binomial. Just like the Poisson, but the rate of occurrence
of event varies according to the gamma distribution.
Normal Distribution. This is a continuous variable, where the
disturbance term is the sum of large number of independent but
unobserved factors. A possible substantive example is presidential
approval over time. The random variable is symmetric, and has
events with nonzero probability occurring everywhere. This means
that a normal distribution cannot generate a random variable that is
discrete, skewed, or bounded. Its pdf can be written out as:
fN (yi |μ,σ) = (2πσ2)-½ exp[-(yi-μ)2/2σ2]
vii.
where π=3.14 and exp[a]=ea
Log-Normal Distribution. It is like the Normal Distribution, but
with no negative values.
VI. The Likelihood Function for a Normal Variable
A. Begin by writing out the stochastic component in its functional form (the pdf that
returns the probability of getting yi in any single observation, given μi). This is the traditional
probability, and it is proportional to the likelihood.
fN (yi |μ,σ) = (2πσ2)-½ exp[-(yi-μ)2/2σ2]
B. If we can assume that there is stochastic independence across our observations
(no autocorrelation), we can use the Pr(Y1i, Y2i) = Pr(Y1i)Pr(Y2i) rule to build a joint
probability distribution for our observations:
f(y|μ) = Π fn(yi|μi)
f(y|μ) = Π (2πσ2)-½ exp[-(yi- μi)2/2σ2]
C. In the next step, called “reparameterization,” we substitute in a systematic
component for our generic parameter. In this case, we substitute a linear function.
f(y|μ) = Π (2πσ2)-½ exp[-(yi- xiβ)2/2σ2]
E. Now we take this traditional probability, which returns an absolute probability,
and use the likelihood axiom to get something that is proportional to the inverse probability.
We also want work with an expression that is mathematically tractable, and since any
monotonic function of the traditional probability can serve as a relative measure of
likelihood, for convenience we will take the natural log of this function. We can also use the
“Fisher-Neyman Factorization Lemma,” which proves that in a likelihood function, we can
drop every term not depending on the parameters, to get rid of k(y). Finally, we are also
going to use algebraic tricks like ln(abc)=ln(a)+ln(b)+ln(c) and ln(ab)=bln(a).
~
~
L(  , ~ 2 | y )  k ( y ) Pr( y |  , ~ 2 )
n
~
~
L(  , ~ 2 | y )  k ( y ) f n ( yi |  , ~ 2 )
i 1
n
~
~
L(  , ~ 2 | y )   f n ( yi |  , ~ 2 )
i 1
~
n
 ( y i  xi  ) 2
~ ~2
2 1 / 2
~
L(  ,  | y )   ( 2 ) exp[
]
2~ 2
i 1
~
 ( y i  xi  ) 2
n
~ ~2
2 1 / 2
~
ln L(  ,  | y )  i 1 ln{( 2 ) exp[
]}
2~ 2
~
 ( y i  xi  ) 2
1
n
~ ~2
2
~
ln L(  ,  | y )  i 1{ ln( 2 ) 
}
2
2~ 2
1
1
1
n
~
~
ln L(  , ~ 2 | y )  i 1{ ln( 2 )  ln( ~ 2 )  ~ 2 ( yi  xi  ) 2 }
2
2
2
1
1
n
~
~
ln L(  , ~ 2 | y )  i 1{ ln( ~ 2 )  ~ 2 ( yi  xi  ) 2 }
2
2
VII. Summarizing a Likelihood Function
A. This log likelihood function is an expression representing a function that could be
graphed, but you would need as many dimensions as: the likelihood value + constant term +
# of independent variables + # of ancillary parameters like σ2 (we have assumed
homeskedacity in this model, making σ2 constant, but we could have chosen to model its
variation across observations). So instead of using all of the information in the function, we
will summarize the function by finding its maximum, the value of β that gives us the greatest
likelihood of having generated the data.
B. Analytical Method. For relatively simple likelihood functions, we can find a
maximum by going through the following four steps.
i. Take the derivative of the log-likelihood with respect to your parameter
vector. The reason that we took the log of the likelihood function is mainly because taking
the derivative of a sum is much easier than taking the derivative of a product.
ii. Set the derivative equal to zero.
iii. Solve for your parameter, thus finding an extreme point.
iv. Find out whether this extreme point is a maxima or minima by taking the
second derivative of the log-likelihood function. If it is negative, the function bows
downward before and after the extreme point and you have a (possible local) maximum.
For our linear model, the analytical solution for the variance parameter is:
σ2= 1/nΣ(yi- xiβ)2 which should look familiar! You are trying to minimize
the squared error here, and thus OLS can be justified by maximum likelihood as well as by
the fact that it is a convenient way to summarize a relationship and that it has all of the
properties that are desirable in an estimator.
C. Numerical/Computations Methods. This is what Stata does, because some
likelihood functions do not have an analytical solution. You can write out a likelihood
function and then begin with a starting value (or vector of values) for the parameter of
interest. Then you can use an algorithm (a recipe for a repeated process) to try out better
and better combinations of parameter values until you maximize the likelihood. The
Newton-Raphsom and Gauss-Newton algorithms are common ones, and they use linear
algebra to take derivatives of matrices. Basically, they start with a parameter vector, get its
likelihood, then look at the gradient of the likelihood function to see which direction they
should move in order to get a higher likelihood value, and keep going until they can’t move
in either direction and get a higher likelihood.
Download