IEOR 165 – Lecture 13 Maximum Likelihood Estimation 1 Motivating Problem Suppose we are working for a grocery store, and we have decided to model service time of an individual using the express lane (for 10 items or less) with an exponential distribution. An exponential service time is a common assumption in basic queuing theory models. Recall that: A random variable X with exponential distribution is denoted by X ∼ E(λ), where λ > 0 is the rate; in our context, the rate is better understood to be the service rate. The exponential distribution has pdf ( λ exp(−λu), if u ≥ 0, . fX (u) = 0 otherwise In our motivating problem, suppose the data Xi for i = 1, . . . , n are assumed to be iid measurements of the service time. The first idea we might have to estimate the service rate is to observe that the mean of an exponential distribution is 1/λ. And so we could estimate the service rate using the equation λ̂ = n 1 X n Xi i=1 −1 . The intuition is that the sample average converges to the expectation as we get more data. And since the service rate is assumed to be λ > 0 in the definition of an exponential random variable, it turns out that the inverse of the sample average will converge to the service rate λ. The potential weakness of this approach is that though it will eventually give us the correct answer, we are not using information we know about the underlying distribution of the data; it may be possible to get an accurate estimate of the service rate using less data with a more sophisticated approach. Recall the likelihood function for a parametric distribution Pθ with pdf fθ (u) is defined as L(X, θ) = n Y fθ (Xi ). i=1 One alternative idea is to choose the value of θ that maximizes this likelihood. And so in order to be able to solve these problems, we have to first discuss optimization problems. 1 2 Nonlinear Programming Consider the following optimization problem (P): min f (x) s.t. x ∈ Rn gi (x) ≤ 0, ∀i = 1, . . . , m hi (x) = 0, ∀i = 1, . . . , k where f (x), gi (x), hi (x) are continuously differentiable functions. We call the function f (x) the objective, and we define the feasible set X as X = {x ∈ Rn : gi (x) ≤ 0 ∀i = 1, . . . , m ∧ hi (x) = 0 ∀i = 1, . . . , k}. Note that this formulation also incorporates maximization problems such as max{f (x) | x ∈ X } through rewriting the problem as min{−f (x) | x ∈ X }. Let B(x, r) = {y ∈ Rn : kx − yk ≤ r} be a ball centered at point x with radius r. A point x∗ is a local minimizer for (P) if there exists ρ > 0 such that f (x∗ ) ≤ f (x) for all x ∈ X ∩ B(x∗ , ρ). A point x∗ is a global minimizer for (P) if f (x∗ ) ≤ f (x) for all x ∈ X . Note that a global minimizer is also a local minimizer, but the opposite may not be true. 2.1 Unconstrained Optimization First consider the special case where f : R → R. When (P) does not have any constraints, we know from calculus (specifically Fermat’s theorem) that the global minimum must occur at points where either (i) the slope is zero f ′ (x) = 0, (ii) at x = −∞, or (iii) at x = ∞. More rigorously, the theorem states that if f ′ (x) 6= 0 for x ∈ R, then this x is not a local minimum. This result extends naturally to the multivariate case where f : Rn → R, by insteading looking for points where the gradient is zero ∇f (x) = 0. This result is useful because it gives one possible approach to solve (P) in the case where there are no constraints: We can find all points where the gradient is zero, evaluate the function in the limits as x tends to x = −∞, or x = ∞, and then select the minimum amongst these points. One natural question to ask is how can we extend this approach to the general case (P)? 2.2 Karush-Kuhn-Tucker (KKT) Conditions If x∗ ∈ X is a local minimizer for (P), then under some technical conditions there exists λi for i = 1, . . . , m and µi for i = 1, . . . , k such that the following equations, known as the 2 Karush-Kuhn-Tucker (KKT) conditions, hold: P Pk ∗ ∗ ∇f (x∗ ) + m i=1 λi ∇gi (x ) + i=1 µi ∇hi (x ) = 0 λi ≥ 0, ∀i = 0, . . . , m λi gi (x∗ ) = 0, ∀i = 1, . . . , m As in the unconstrained case, this is a necessary (but not sufficient) condition for optimality. 2.3 Example Consider the following optimization problem min x s.t. x ∈ R − x ≤ 0. The KKT conditions are that the local minima x∗ satisfy 1 − λ1 = 0 λ1 ≥ 0 − λ 1 x∗ = 0 Also, all possible combinations of (λ1 , x) for which the KKT conditions hold can be computed. From the condition: 1 − λ1 = 0, we have that λ1 = 1. Combining this with the condition: −λ1 x = 0 gives that x = 0. This is the only possible solution of the KKT conditions, and so the local minimizer must also be the global minimizer. Summarizing, the global minimizer is x∗ = 0. 3 Maximum Likelihood Estimation The idea of maximum likelihood estimation (MLE) is to generate an estimate of some unknown parameters by solving the following optimization problem Q max{L(X, θ) | θ ∈ Θ} = max{ ni=1 fθ (Xi ) | θ ∈ Θ}. Here, the constraint θ ∈ Θ is meant to be short-hand notation for the inequality and equality constraints discussed earlier. 3.1 Example: Service Rate of Exponential Distribution To get a better understanding of how this approach works, consider the case of using MLE to estimate the service rate of an exponential. The likelihood is P Q L(X, λ) = ni=1 λ exp(−λXi ) = λn exp(−λ ni=1 Xi ). 3 And we would like to solve the problem max{λn exp(−λ Pn i=1 Xi ) | λ > 0}. This formulation does not exactly fit the framework described above, and the reason is that strict inequalities are somewhat problematic for optimization problems. The reason is that a minimum may not exist. And so we will consider the following optimization P max{λn exp(−λ ni=1 Xi ) | λ ≥ 0}, and we will hope that λ = 0 is not the solution because this would indicate a problem. In principle, we can solve this problem by finding all points satisfying the KKT conditions. However, sometimes ad hoc approaches can be easier. For instance, suppose we find the points at which the gradient is zero: P P P nλn−1 exp(−λ ni=1 Xi ) + −λn exp(−λ ni=1 Xi ) · ni=1 Xi = 0. Now assuming λ > 0, we can simplify this to n + −λ · Pn i=1 Xi = 0 ⇒ λ = ∗ P n 1 n i=1 Xi −1 . Since the random variables satisfy Xi > 0 when drawn from an exponential distribution, this estimate is strictly positive. The objective at λ = 0 is zero, and the objective as λ → 0 is also zero. Consequently, the above solution must be the global maximum since the function is strictly concave (i.e., concave down) in λ. Observe that in this case the MLE estimate is equivalent to the inverse of the sample average. However, in other cases the MLE estimate can differ from the estimates we previously considered. 3.2 Transformation of the Objective Function In some cases, it can be easier to solve a transformed version of the optimization problem. For instance, two transformations that are useful in practice are: 1. instead of maximizing the likelihood, we could equivalently minimize the negative of the likelihood; 2. instead of maximizing the likelihood, we could equivalently maximize the likelihood composed with a strictly increasing function. In fact, one very useful transformation is the minimize the negative log-likelihood. The utility of such transformations is most clearly elucidated by an example. 4 3.3 Example: Mean and Variance of Gaussian Suppose Xi ∼ N (µ, σ 2 ) are iid measurements from a Gaussian with unknown mean and variance. Then the MLE estimate is given by the solution to Pn 2 2 2 max{(2πσ 2 )−n/2 exp i=1 −(Xi − µ) /(2σ ) | σ > 0}. If we instead minimize the negative log-likelihood, then the problem we would like to solve is P min{(n/2) log(2πσ 2 ) + ni=1 (Xi − µ)2 /(2σ 2 ) | σ 2 > 0}. P If we consider the objective function h(µ, σ 2 ) = (n/2) log(2πσ 2 ) + ni=1 (Xi − µ)2 /(2σ 2 ) (that is we are treating the σ 2 as a single variable), then setting its gradient equal to zero gives Pn 2 2(X − µ)/(2σ ) hµ i i=1 P = 0, = n/(2σ 2 ) − ni=1 (Xi − µ)2 /(2(σ 2 )2 ) hσ 2 where hµ and hσ2 denote the partial derivatives of h(µ, σ 2 ) with respect to µ and σ 2 , respectively. To solve this, we first begin with P P P hµ = ni=1 2(Xi − µ)/(2σ 2 ) = 0 ⇒ ni=1 (Xi − µ) = 0 ⇒ µ̂ = n1 ni=1 Xi , where we have made use of the fact that σ 2 > 0. Next, note that P P hσ2 = n/(2σ 2 )− ni=1 (Xi −µ̂)2 /(2(σ 2 )2 ) = 0 ⇒ nσ 2 = ni=1 (Xi −µ̂)2 ⇒ σˆ2 = where we multiplied by (σ 2 )2 on the second step. 1 n Pn i=1 (Xi −µ̂) 2 P The estimate of the mean µ̂ = n1 ni=1 Xi is simply the sample average; however, the estimate P of the variance (cf. the unbiased of the variance σˆ2 = n1 ni=1 (Xi − µ̂)2 is the biased Pestimator n 1 2 estimate used for hypothesis testing s = n−1 i=1 (Xi − µ̂)2 ). So in this case, the MLE estimates differ from the estimates we have previously used for hypothesis testing. 5 ,