IEOR 165 – Lecture 14 Linear Models 1 Regression Suppose we have a system in which an input x ∈ Rk gives an output y ∈ R, and suppose there is a static relationship between x and y that is given by y = f (x) + ǫ, where ǫ is zero mean noise with finite variance (i.e., E(ǫ) = 0 and var(ǫ) = σ 2 < ∞). We will also assume that ǫ is independent of x; otherwise our model has what is known as endogeneity, which is a common topic of study in econometrics. The process of modeling involves using measured data to identify the relationship between x and y, meaning identify E[y|x] = f (x). This is a huge topic of inquiry, but in this course we will categorize this regression problem into three classes: 1. Parametric Regression – The unknown function f (x) is characterized by a finite number of parameters. It is common to think of f (x; β), where β ∈ Rp is a vector of unknown parameters. The simplest example is a linear model, in which we have f (x; β) = p X βj xj . j=1 This approach is used when there is strong a priori knowledge about the structure of the system (e.g., physics, biology, etc.). 2. Nonparametric Regression – The unknown function f (x) is characterized by an infinite number of parameters. For instance, we might want to represent f (x) as an infinite polynomial expansion f (x) = β0 + β1 x + β2 x2 + . . . . This approach is used when there is little a priori knowledge about the structure of the system. Though it might seem that this approach is superior because it is more flexible than parametric regression, it turns out that one must pay a statistical penalty because of the need to estimate a greater number of parameters. 3. Semiparametric Regression – The unknown function f (x) is characterized by a component with a finite number of parameters and another component with an infinite number of parameters. In some cases, the infinite number of parameters are known as 1 nuisance parameters; however, in other cases this infinite component might have useful information in and of itself. A classic example is a partially linear model: f (x) = m X βj xj + g(xm+1 , . . . , xk ). j=1 Here, the g(xm+1 , . . . , xk ) is represented non-parametrically, and the is the parametric component. Pm j=1 βj xj term This categorization is quite crude because in some problems the classes can blend into each other. For instance, high-dimensional parametric regression can be thought of as nonparametric regression. The key problem in regression is that of regularization. The idea of regularization is to improve the statistical properties of estimates by imposing additional structure onto the model. 2 Gaussian Noise Suppose that we have pairs of independent measurements (xi , yi ) for i = 1, . . . , n, where xi ∈ Rp and yi ∈ R, and that the system is described by a linear model yi = p X βj xji + ǫi = x′i β + ǫi . j=1 Furthermore, suppose that the noise is described by a zero-mean Gaussian distribution: ǫi ∼ N (0, σ 2 ), where σ 2 is unknown. In this setting, we can use MLE to estimate the unknown parameters of the linear model. We begin by noting that the likelihood is P L(X, β, σ 2 ) = (2πσ 2 )−n/2 exp − ni=1 (yi − x′i β)2 /(2σ 2 ) . And so the MLE is given by minimizing the negative log-likelihood n nn o X min log(2πσ 2 ) + (yi − x′i β)2 /(2σ 2 ) | σ 2 > 0 . 2 i=1 P This problem has a peculiar structure! In particular, the minimum of ni=1 (yi − x′i β)2 is independent of σ 2 . And so the MLE estimate of the parameters is equivalently given by the solution to n X β̂ = arg min (yi − x′i β)2 . i=1 n×p If we define the matrix X ∈ R and a vector Y ∈ Rn such that the i-th row of X is x′i and the i-th row of Y is yi , then this problem can be rewritten as β̂ = arg min kY − Xβk22 , 2 where kp· k2 is the usual L2 -norm. kvk2 = (v 1 )2 + . . . + (v k )2 .) (Recall that for a vector v ∈ Rk the L2 -norm is From a theoretical perspective, we can compute a closed-form expression for the minimizer. Let J(β) be the objective of the above optimization problem, then taking the gradient and setting it to zero gives ∇β J = 2X ′ (Y − X β̂) = 0 ⇒ X ′ X β̂ = X ′ Y ⇒ β̂ = (X ′ X)−1 (X ′ Y ), where we have assumed that (X ′ X)−1 is invertible. From a practical/implementation standpoint, it is not a good idea to compute the estimates using this equation. There has been a lot of development of algorithms with good numerical properties to solve this problem, and these algorithms can be used with simple commands in most programming languages. For instance, in MATLAB we would use the command “X\Y”. In the C or Fortran languages, we could use the LAPACK package. One dangling point remains: We still have to provide an estimate of σ 2 . In some applications, we do not care about the variance of the noise. In these cases, this variance is known as a nuisance parameter. But suppose we have an application where we care about the variance of the noise. Then, we can finish computing the MLE estimate for the noise. In particular, suppose we set the derivative of the negative log-likelihood with respect to σ 2 (treated as a variable) equal to zero; then, this gives n kY − X β̂k22 1 − = 0 ⇒ σˆ2 = kY − X β̂k22 . 2 2 2 2σ 2(σ ) n 3 Laplacian Noise Gaussian noise is not the only possibility. Some noise distributions are called “long-tailed” noise because the probability of seeing extreme values for these distributions is greater than that for the Gaussian distribution. A canonical example is the Laplacian distribution, which is defined as a random variable with pdf given by 1 f (u) = exp − |u − µ|/b , 2b where µ is the location parameter and b > 0 is a scale parameter. The mean of this distribution is µ, and the variance is 2b2 . It is interesting to consider what the MLE estimate for the parameters of a linear model with Laplacian noise looks like. Specifically, this is the setting where the model is yi = p X βj xji + ǫi = x′i β + ǫi , j=1 3 and the noise is described by a zero-mean Laplacian distribution with unknown scale parameter b. The likelihood for pairs of independent measurements (xi , yi ) for i = 1, . . . , n is given by P L(X, β, b) = (2b)−n exp − ni=1 |yi − x′i β|/b . So the MLE is given by the minimizer to the negative log-likelihood n o n X min n log(2b) + |yi − x′i β|/b b > 0 . i=1 Pn Now observe that the minimum of i=1 |yi − x′i β| is independent of b. And so the MLE estimate of the parameters is equivalently given by the solution to β̂ = arg min n X |yi − x′i β|. i=1 If we define the matrix X ∈ Rn×p and a vector Y ∈ Rn such that the i-th row of X is x′i and the i-th row of Y is yi , then this problem can be rewritten as β̂ = arg min kY − Xβk1 , where k · k1 is the usual L1 -norm. (Recall that for a vector v ∈ Rk the L1 -norm is kvk1 = |v 1 | + . . . + |v k |.) There is no closed-form expression for the solution to this optimization problem. Fortunately, this optimization problem is a specific case of a Linear Program (LP), and so there are efficient algorithms to solve this problem. Lastly, we can ask what is the estimate of the nuisance parameter b? Setting the derivative with respect to b of the negative log-likelihood equal to zero and then solving gives n kY − X β̂k1 1 − = 0 ⇒ b̂ = kY − X β̂k1 . b b2 n 4 Unknown Noise Distribution In cases where we do not know the noise distribution for ǫi , we can still use our solutions from above to estimate the parameters of a linear model. If the noise has short tails, a common choice is ordinary least squares (OLS), which consists of solving β̂ = arg min kY − Xβk22 . If the noise has long tails, a common choice is least absolute deviations (LAD), which consists of solving β̂ = arg min kY − Xβk1 . 4 Interestingly, these methods will converge to the true parameters (under some technical conditions) regardless of the actual noise distribution. The difference is that OLS gives lower estimation error (compared to LAD) when the noise is Gaussian, and LAD gives lower estimation error (compared to OLS) when the noise is Laplacian. This is an example where explicitly considering the distributions can give better estimates than a method that ignores the the noise distribution. 5