IEOR 165 – Lecture 14 Linear Models 1 Regression

advertisement
IEOR 165 – Lecture 14
Linear Models
1 Regression
Suppose we have a system in which an input x ∈ Rk gives an output y ∈ R, and suppose
there is a static relationship between x and y that is given by y = f (x) + ǫ, where ǫ is zero
mean noise with finite variance (i.e., E(ǫ) = 0 and var(ǫ) = σ 2 < ∞). We will also assume
that ǫ is independent of x; otherwise our model has what is known as endogeneity, which is
a common topic of study in econometrics.
The process of modeling involves using measured data to identify the relationship between
x and y, meaning identify E[y|x] = f (x). This is a huge topic of inquiry, but in this course
we will categorize this regression problem into three classes:
1. Parametric Regression – The unknown function f (x) is characterized by a finite number
of parameters. It is common to think of f (x; β), where β ∈ Rp is a vector of unknown
parameters. The simplest example is a linear model, in which we have
f (x; β) =
p
X
βj xj .
j=1
This approach is used when there is strong a priori knowledge about the structure of
the system (e.g., physics, biology, etc.).
2. Nonparametric Regression – The unknown function f (x) is characterized by an infinite
number of parameters. For instance, we might want to represent f (x) as an infinite
polynomial expansion
f (x) = β0 + β1 x + β2 x2 + . . . .
This approach is used when there is little a priori knowledge about the structure of
the system. Though it might seem that this approach is superior because it is more
flexible than parametric regression, it turns out that one must pay a statistical penalty
because of the need to estimate a greater number of parameters.
3. Semiparametric Regression – The unknown function f (x) is characterized by a component with a finite number of parameters and another component with an infinite
number of parameters. In some cases, the infinite number of parameters are known as
1
nuisance parameters; however, in other cases this infinite component might have useful
information in and of itself. A classic example is a partially linear model:
f (x) =
m
X
βj xj + g(xm+1 , . . . , xk ).
j=1
Here, the g(xm+1 , . . . , xk ) is represented non-parametrically, and the
is the parametric component.
Pm
j=1
βj xj term
This categorization is quite crude because in some problems the classes can blend into each
other. For instance, high-dimensional parametric regression can be thought of as nonparametric regression. The key problem in regression is that of regularization. The idea of
regularization is to improve the statistical properties of estimates by imposing additional
structure onto the model.
2 Gaussian Noise
Suppose that we have pairs of independent measurements (xi , yi ) for i = 1, . . . , n, where
xi ∈ Rp and yi ∈ R, and that the system is described by a linear model
yi =
p
X
βj xji + ǫi = x′i β + ǫi .
j=1
Furthermore, suppose that the noise is described by a zero-mean Gaussian distribution:
ǫi ∼ N (0, σ 2 ), where σ 2 is unknown.
In this setting, we can use MLE to estimate the unknown parameters of the linear model.
We begin by noting that the likelihood is
P
L(X, β, σ 2 ) = (2πσ 2 )−n/2 exp − ni=1 (yi − x′i β)2 /(2σ 2 ) .
And so the MLE is given by minimizing the negative log-likelihood
n
nn
o
X
min
log(2πσ 2 ) +
(yi − x′i β)2 /(2σ 2 ) | σ 2 > 0 .
2
i=1
P
This problem has a peculiar structure! In particular, the minimum of ni=1 (yi − x′i β)2 is
independent of σ 2 . And so the MLE estimate of the parameters is equivalently given by the
solution to
n
X
β̂ = arg min
(yi − x′i β)2 .
i=1
n×p
If we define the matrix X ∈ R
and a vector Y ∈ Rn such that the i-th row of X is x′i and
the i-th row of Y is yi , then this problem can be rewritten as
β̂ = arg min kY − Xβk22 ,
2
where kp· k2 is the usual L2 -norm.
kvk2 = (v 1 )2 + . . . + (v k )2 .)
(Recall that for a vector v ∈ Rk the L2 -norm is
From a theoretical perspective, we can compute a closed-form expression for the minimizer.
Let J(β) be the objective of the above optimization problem, then taking the gradient and
setting it to zero gives
∇β J = 2X ′ (Y − X β̂) = 0 ⇒ X ′ X β̂ = X ′ Y
⇒ β̂ = (X ′ X)−1 (X ′ Y ),
where we have assumed that (X ′ X)−1 is invertible. From a practical/implementation standpoint, it is not a good idea to compute the estimates using this equation. There has been a
lot of development of algorithms with good numerical properties to solve this problem, and
these algorithms can be used with simple commands in most programming languages. For
instance, in MATLAB we would use the command “X\Y”. In the C or Fortran languages,
we could use the LAPACK package.
One dangling point remains: We still have to provide an estimate of σ 2 . In some applications,
we do not care about the variance of the noise. In these cases, this variance is known as a
nuisance parameter. But suppose we have an application where we care about the variance
of the noise. Then, we can finish computing the MLE estimate for the noise. In particular,
suppose we set the derivative of the negative log-likelihood with respect to σ 2 (treated as a
variable) equal to zero; then, this gives
n
kY − X β̂k22
1
−
= 0 ⇒ σˆ2 = kY − X β̂k22 .
2
2
2
2σ
2(σ )
n
3 Laplacian Noise
Gaussian noise is not the only possibility. Some noise distributions are called “long-tailed”
noise because the probability of seeing extreme values for these distributions is greater than
that for the Gaussian distribution. A canonical example is the Laplacian distribution, which
is defined as a random variable with pdf given by
1
f (u) =
exp − |u − µ|/b ,
2b
where µ is the location parameter and b > 0 is a scale parameter. The mean of this distribution is µ, and the variance is 2b2 .
It is interesting to consider what the MLE estimate for the parameters of a linear model
with Laplacian noise looks like. Specifically, this is the setting where the model is
yi =
p
X
βj xji + ǫi = x′i β + ǫi ,
j=1
3
and the noise is described by a zero-mean Laplacian distribution with unknown scale parameter b. The likelihood for pairs of independent measurements (xi , yi ) for i = 1, . . . , n is
given by
P
L(X, β, b) = (2b)−n exp − ni=1 |yi − x′i β|/b .
So the MLE is given by the minimizer to the negative log-likelihood
n
o
n
X
min n log(2b) +
|yi − x′i β|/b b > 0 .
i=1
Pn
Now observe that the minimum of i=1 |yi − x′i β| is independent of b. And so the MLE
estimate of the parameters is equivalently given by the solution to
β̂ = arg min
n
X
|yi − x′i β|.
i=1
If we define the matrix X ∈ Rn×p and a vector Y ∈ Rn such that the i-th row of X is x′i and
the i-th row of Y is yi , then this problem can be rewritten as
β̂ = arg min kY − Xβk1 ,
where k · k1 is the usual L1 -norm. (Recall that for a vector v ∈ Rk the L1 -norm is
kvk1 = |v 1 | + . . . + |v k |.) There is no closed-form expression for the solution to this optimization problem. Fortunately, this optimization problem is a specific case of a Linear
Program (LP), and so there are efficient algorithms to solve this problem.
Lastly, we can ask what is the estimate of the nuisance parameter b? Setting the derivative with respect to b of the negative log-likelihood equal to zero and then solving gives
n kY − X β̂k1
1
−
=
0
⇒
b̂
=
kY − X β̂k1 .
b
b2
n
4 Unknown Noise Distribution
In cases where we do not know the noise distribution for ǫi , we can still use our solutions
from above to estimate the parameters of a linear model. If the noise has short tails, a
common choice is ordinary least squares (OLS), which consists of solving
β̂ = arg min kY − Xβk22 .
If the noise has long tails, a common choice is least absolute deviations (LAD), which consists
of solving
β̂ = arg min kY − Xβk1 .
4
Interestingly, these methods will converge to the true parameters (under some technical
conditions) regardless of the actual noise distribution. The difference is that OLS gives
lower estimation error (compared to LAD) when the noise is Gaussian, and LAD gives lower
estimation error (compared to OLS) when the noise is Laplacian. This is an example where
explicitly considering the distributions can give better estimates than a method that ignores
the the noise distribution.
5
Download