IEOR 165 – Lecture 13 Maximum Likelihood Estimation 1 Motivating Problem

advertisement
IEOR 165 – Lecture 13
Maximum Likelihood Estimation
1 Motivating Problem
Suppose we are working for a grocery store, and we have decided to model service time of an
individual using the express lane (for 10 items or less) with an exponential distribution. An
exponential service time is a common assumption in basic queuing theory models. Recall
that: A random variable X with exponential distribution is denoted by X ∼ E(λ), where
λ > 0 is the rate; in our context, the rate is better understood to be the service rate. The
exponential distribution has pdf
(
λ exp(−λu), if u ≥ 0,
.
fX (u) =
0
otherwise
In our motivating problem, suppose the data Xi for i = 1, . . . , n are assumed to be iid measurements of the service time.
The first idea we might have to estimate the service rate is to observe that the mean of an
exponential distribution is 1/λ. And so we could estimate the service rate using the equation
λ̂ =
n
1 X
n
Xi
i=1
−1
.
The intuition is that the sample average converges to the expectation as we get more data.
And since the service rate is assumed to be λ > 0 in the definition of an exponential random
variable, it turns out that the inverse of the sample average will converge to the service rate
λ. The potential weakness of this approach is that though it will eventually give us the
correct answer, we are not using information we know about the underlying distribution of
the data; it may be possible to get an accurate estimate of the service rate using less data
with a more sophisticated approach.
Recall the likelihood function for a parametric distribution Pθ with pdf fθ (u) is defined as
L(X, θ) =
n
Y
fθ (Xi ).
i=1
One alternative idea is to choose the value of θ that maximizes this likelihood. And so in
order to be able to solve these problems, we have to first discuss optimization problems.
1
2 Nonlinear Programming
Consider the following optimization problem (P):
min f (x)
s.t. x ∈ Rn
gi (x) ≤ 0, ∀i = 1, . . . , m
hi (x) = 0, ∀i = 1, . . . , k
where f (x), gi (x), hi (x) are continuously differentiable functions. We call the function f (x)
the objective, and we define the feasible set X as
X = {x ∈ Rn : gi (x) ≤ 0 ∀i = 1, . . . , m ∧ hi (x) = 0 ∀i = 1, . . . , k}.
Note that this formulation also incorporates maximization problems such as max{f (x) | x ∈
X } through rewriting the problem as min{−f (x) | x ∈ X }.
Let B(x, r) = {y ∈ Rn : kx − yk ≤ r} be a ball centered at point x with radius r. A
point x∗ is a local minimizer for (P) if there exists ρ > 0 such that f (x∗ ) ≤ f (x) for all
x ∈ X ∩ B(x∗ , ρ). A point x∗ is a global minimizer for (P) if f (x∗ ) ≤ f (x) for all x ∈ X .
Note that a global minimizer is also a local minimizer, but the opposite may not be true.
2.1
Unconstrained Optimization
First consider the special case where f : R → R. When (P) does not have any constraints,
we know from calculus (specifically Fermat’s theorem) that the global minimum must occur
at points where either (i) the slope is zero f ′ (x) = 0, (ii) at x = −∞, or (iii) at x = ∞.
More rigorously, the theorem states that if f ′ (x) 6= 0 for x ∈ R, then this x is not a local
minimum. This result extends naturally to the multivariate case where f : Rn → R, by
insteading looking for points where the gradient is zero ∇f (x) = 0.
This result is useful because it gives one possible approach to solve (P) in the case where
there are no constraints: We can find all points where the gradient is zero, evaluate the
function in the limits as x tends to x = −∞, or x = ∞, and then select the minimum
amongst these points. One natural question to ask is how can we extend this approach to
the general case (P)?
2.2
Karush-Kuhn-Tucker (KKT) Conditions
If x∗ ∈ X is a local minimizer for (P), then under some technical conditions there exists
λi for i = 1, . . . , m and µi for i = 1, . . . , k such that the following equations, known as the
2
Karush-Kuhn-Tucker (KKT) conditions, hold:
P
Pk
∗
∗
∇f (x∗ ) + m
i=1 λi ∇gi (x ) +
i=1 µi ∇hi (x ) = 0
λi ≥ 0, ∀i = 0, . . . , m
λi gi (x∗ ) = 0, ∀i = 1, . . . , m
As in the unconstrained case, this is a necessary (but not sufficient) condition for optimality.
2.3
Example
Consider the following optimization problem
min x
s.t. x ∈ R
− x ≤ 0.
The KKT conditions are that the local minima x∗ satisfy
1 − λ1 = 0
λ1 ≥ 0
− λ 1 x∗ = 0
Also, all possible combinations of (λ1 , x) for which the KKT conditions hold can be computed.
From the condition: 1 − λ1 = 0, we have that λ1 = 1. Combining this with the condition:
−λ1 x = 0 gives that x = 0. This is the only possible solution of the KKT conditions, and so
the local minimizer must also be the global minimizer. Summarizing, the global minimizer
is x∗ = 0.
3 Maximum Likelihood Estimation
The idea of maximum likelihood estimation (MLE) is to generate an estimate of some unknown parameters by solving the following optimization problem
Q
max{L(X, θ) | θ ∈ Θ} = max{ ni=1 fθ (Xi ) | θ ∈ Θ}.
Here, the constraint θ ∈ Θ is meant to be short-hand notation for the inequality and equality
constraints discussed earlier.
3.1
Example: Service Rate of Exponential Distribution
To get a better understanding of how this approach works, consider the case of using MLE
to estimate the service rate of an exponential. The likelihood is
P
Q
L(X, λ) = ni=1 λ exp(−λXi ) = λn exp(−λ ni=1 Xi ).
3
And we would like to solve the problem
max{λn exp(−λ
Pn
i=1
Xi ) | λ > 0}.
This formulation does not exactly fit the framework described above, and the reason is that
strict inequalities are somewhat problematic for optimization problems. The reason is that
a minimum may not exist. And so we will consider the following optimization
P
max{λn exp(−λ ni=1 Xi ) | λ ≥ 0},
and we will hope that λ = 0 is not the solution because this would indicate a problem.
In principle, we can solve this problem by finding all points satisfying the KKT conditions.
However, sometimes ad hoc approaches can be easier. For instance, suppose we find the
points at which the gradient is zero:
P
P
P
nλn−1 exp(−λ ni=1 Xi ) + −λn exp(−λ ni=1 Xi ) · ni=1 Xi = 0.
Now assuming λ > 0, we can simplify this to
n + −λ ·
Pn
i=1
Xi = 0 ⇒ λ =
∗
P
n
1
n
i=1
Xi
−1
.
Since the random variables satisfy Xi > 0 when drawn from an exponential distribution, this
estimate is strictly positive. The objective at λ = 0 is zero, and the objective as λ → 0 is
also zero. Consequently, the above solution must be the global maximum since the function
is strictly concave (i.e., concave down) in λ.
Observe that in this case the MLE estimate is equivalent to the inverse of the sample average. However, in other cases the MLE estimate can differ from the estimates we previously
considered.
3.2
Transformation of the Objective Function
In some cases, it can be easier to solve a transformed version of the optimization problem.
For instance, two transformations that are useful in practice are:
1. instead of maximizing the likelihood, we could equivalently minimize the negative of
the likelihood;
2. instead of maximizing the likelihood, we could equivalently maximize the likelihood
composed with a strictly increasing function.
In fact, one very useful transformation is the minimize the negative log-likelihood. The
utility of such transformations is most clearly elucidated by an example.
4
3.3
Example: Mean and Variance of Gaussian
Suppose Xi ∼ N (µ, σ 2 ) are iid measurements from a Gaussian with unknown mean and
variance. Then the MLE estimate is given by the solution to
Pn
2
2
2
max{(2πσ 2 )−n/2 exp
i=1 −(Xi − µ) /(2σ ) | σ > 0}.
If we instead minimize the negative log-likelihood, then the problem we would like to solve
is
P
min{(n/2) log(2πσ 2 ) + ni=1 (Xi − µ)2 /(2σ 2 ) | σ 2 > 0}.
P
If we consider the objective function h(µ, σ 2 ) = (n/2) log(2πσ 2 ) + ni=1 (Xi − µ)2 /(2σ 2 ) (that
is we are treating the σ 2 as a single variable), then setting its gradient equal to zero gives
Pn
2
2(X
−
µ)/(2σ
)
hµ
i
i=1
P
= 0,
=
n/(2σ 2 ) − ni=1 (Xi − µ)2 /(2(σ 2 )2 )
hσ 2
where hµ and hσ2 denote the partial derivatives of h(µ, σ 2 ) with respect to µ and σ 2 , respectively. To solve this, we first begin with
P
P
P
hµ = ni=1 2(Xi − µ)/(2σ 2 ) = 0 ⇒ ni=1 (Xi − µ) = 0 ⇒ µ̂ = n1 ni=1 Xi ,
where we have made use of the fact that σ 2 > 0. Next, note that
P
P
hσ2 = n/(2σ 2 )− ni=1 (Xi −µ̂)2 /(2(σ 2 )2 ) = 0 ⇒ nσ 2 = ni=1 (Xi −µ̂)2 ⇒ σˆ2 =
where we multiplied by (σ 2 )2 on the second step.
1
n
Pn
i=1 (Xi −µ̂)
2
P
The estimate of the mean µ̂ = n1 ni=1 Xi is simply the sample average; however, the estimate
P
of the variance (cf. the unbiased
of the variance σˆ2 = n1 ni=1 (Xi − µ̂)2 is the biased
Pestimator
n
1
2
estimate used for hypothesis testing s = n−1 i=1 (Xi − µ̂)2 ). So in this case, the MLE
estimates differ from the estimates we have previously used for hypothesis testing.
5
,
Download