ERRORS IN THE DEPENDENT VARIABLE OF QUANTILE REGRESSION MODELS JERRY HAUSMAN, YE LUO, AND CHRISTOPHER PALMER MASSACHUSETTS INSTITUTE OF TECHNOLOGY APRIL 2014 *PRELIMINARY AND INCOMPLETE* *PLEASE DO NOT CITE OR CIRCULATE WITHOUT PERMISSION* Abstract. The usual quantile regression estimator of Koenker and Bassett (1978) is biased if there is an additive error term in the dependent variable. We analyze this problem as an errors-in-variables problem where the dependent variable suffers from classical measurement error and develop a sieve maximum-likelihood approach that is robust to left-hand side measurement error. After describing sufficient conditions for identification, we employ global optimization algorithms (specifically, a genetic algorithm) to solve the MLE optimization problem. We also propose a computationally tractable EM-type algorithm that accelerates the computational speed in the neighborhood of the optimum. When the number of knots in the quantile grid is chosen to grow at an adequate speed, the sieve maximum-likelihood estimator is asymptotically normal. We verify our theoretical results with Monte Carlo simulations and illustrate our estimator with an application to the returns to education highlighting important changes over time in the returns to education that have been obscured in previous work by measurement-error bias. Keywords: Measurement Error, Quantile Regression, Functional Analysis, EM Algorithm Hausman: Professor of Economics, MIT Economics; jhausman@mit.edu. Luo: PhD Candidate, MIT Economics; lkurt@mit.edu. Palmer: PhD Candidate, MIT Economics; cjpalmer@mit.edu. We thank Victor Chernozhukov and Kirill Evdokimov for helpful discussions, as well as workshop participants at MIT. Cuicui Chen provided research assistance. ERRORS IN THE DEPENDENT VARIABLE OF QUANTILE REGRESSION MODELS 1 1. Introduction In this paper, we study the consequences of measurement error in the dependent variable of conditional quantile models and propose an adaptive EM algorithm to consistently estimate the distributional effects of covariates in such a setting. The errors-in-variables (EIV) problem has received significant attention in the linear model, including the well-known results that classical measurement error causes attenuation bias if present in the regressors and has no effect on unbiasedness if present in the dependent variable. See Hausman (2001) for an overview. In general, the linear model results do not hold in nonlinear models.1 We are particularly interested in the linear quantile regression setting.2 Hausman (2001) observes that EIV in the dependent variable in quantile regression models generally leads to significant bias, a result very different from the linear model intuition. In general, EIV in the dependent variable can be viewed as a mixture model.3 One way to estimate a mixture distribution is to use the expectation-maximization (EM) algorithm.4 We show that under certain discontinuity assumptions, by choosing the growth speed of the number of knots in the quantile grid, our estimator has fractional polynomial of n convergence speed and asymptotic normality. We suggest using the bootstrap for inference. However, in practice we find that EM algorithm is only feasible in the local neighborhood of the true parameter. We propose to use global optimization algorithms such as the genetic algorithm, in order to compute the sieve ML estimator we propose. � ) for quantile τ may be far from the Intuitively, the estimated quantile regression line xi β(τ observed yi because of LHS measurement error or because the unobserved conditional quantile ui of observation i is far from τ . Our MLE framework implemented via an EM algorithm effectively estimates the likelihood that a given εik is large because of measurement error rather than an unobserved conditional quantile that is far away from τk . The estimate of the joint distribution of the conditional quantile and the measurement error allows us to weight the log 1 Schennach (2008) establishes identification and a consistent nonparametric estimator when EIV exists in an explanatory variable. Studies focusing on nonlinear models in which the left-hand side variable is measured with error include Hausman et. al (1998) and Cosslett (2004), who study probit and tobit models, respectively. 2 Carroll and Wei (2009) proposed an iterative estimator for the quantile regression when one of the regressors has EIV. 3 A common feature of mixture models under a semiparametric or nonparametric framework is the ill-posed inverse problem, see Evdokimov (2010). We face the ill-posed problem here, and our model specifications are linked to the Fredholm integral equation of the first kind. The inverse of such integral equations is usually ill-posed even if the integral kernel is positive definite. The key symptom of these model specifications is that the high-frequency signal of the objective we are interested in is wiped out, or at least shrunk, by the unknown noise if its distribution is smooth. To uncover these signals is difficult and all feasible estimators have a lower √ speed of convergence compare to the usual n case. The convergence speed of our estimator relies on the decay speed of the eigenvalues of the integral operator. We explain this technical problem in more detail in the related section of this paper. 4 Dempster, Laird and Rubin (1977) propose that the EM algorithm performs well in finite mixture models when the EIV follows some (unknown) parametric family of distributions. This algorithm can be extended to infinite mixture models. 2 HAUSMAN, LUO, AND PALMER likelihood contribution of observation i more in the estimation of β(τk ) where they are likely to have ui ≈ τk . In the case of Gaussian errors in variables, this estimator reduces to weighted least squares, with weights equal to the probability of observing the quantile-specific residual for a given observation as a fraction of the total probability of the same observation’s residuals across all quantiles. An empirical example (extending Angrist et al., 2006) studies the heterogeneity of returns to education across conditional quantiles of the wage distribution. We find that when we correct for likely measurement error in the self-reported wage data, we estimate considerably more heterogeneity across the wage distribution in the returns to education. In particular, the education coefficient for the bottom of the wage distribution is lower than previously estimated, and the returns to education for latently high-wage individuals has been increasing over time and is much higher than previously estimated. By 2000, the returns to education for the top of the conditional wage distribution are over three times larger than returns for any other segment of the distribution. The rest of the paper proceeds as follows. In Section 2, we introduce model specification and identification conditions. In Section 3, we consider the MLE estimation method and analyzes its properties. In Sections 4 and 5, we discuss sieve estimation and the EM algorithm. We present Monte Carlo simulation results in Section 6, and Section 7 contains our empirical application. Section 8 concludes. The Appendix contains an extension using the deconvolution method and additional proofs. Notation: Define the domain of x as X . Define the space of y as Y. Denote a ∧ b as the minimum of a and b, and denote a ∨ b as the larger of a and b. Let − → be weak convergence d (convergence in distribution), and → − stands for convergence in probability. Let − →∗ be weak p d convergence in outer probability. Let f (ε|σ) be the p.d.f of the EIV ε parametrized by σ. Assume the true parameters are β0 (·) and σ0 for the coefficient of the quantile model and parameter of the density function of the EIV. Let dx be the dimension of x. Let Ω be the � domain of σ. Let dσ be the dimension of parameter σ. Define ||(β0, σ0 )|| := ||β0 ||22 + ||σ0 ||22 as the L2 norm of (β0, σ0 ), where || · ||2 is the usual Euclidean norm. For βk ∈ Rk , define � ||(βk, σ0 )||2 := ||βk ||22 /k + ||σ0 ||22 . 2. Model and Identification We consider the standard linear conditional quantile model, where the τ th quantile of the dependent variable y ∗ is a linear function of x Qy∗ (τ |x) = xβ(τ ). However, we are interested in the situation where y ∗ is not directly observed, and we instead observe y where y = y∗ + ε ERRORS IN THE DEPENDENT VARIABLE OF QUANTILE REGRESSION MODELS 3 Figure 1. Check Function ρτ (z) τ 1−τ 0 and ε is a mean-zero, i.i.d error term independent from y ∗ , x and τ . Unlike the linear regression case where EIV in the left hand side variable does not matter for consistency and asymptotic normality, EIV in the dependent variable can lead to severe bias in quantile regression. More specifically, with ρτ (z) denoting the check function (plotted in Figure 1) ρτ (z) = z(τ − 1(z < 0)), the minimization problem in the usual quantile regression β(τ ) ∈ arg min E[ρτ (y − xb)], b (2.1) is generally no longer minimized at the true β0 (τ ) when EIV exists in the dependent variable. When there exists no EIV in the left-hand side variable, i.e. y ∗ is observed, the FOC is E[x(τ − 1(y ∗ < xβ(τ )))] = 0, (2.2) where the true β(τ ) is the solution to the above system of FOC conditions as shown by Koenker � ) and Bassett (1978). However, with left-hand side EIV, the FOC condition determining β(τ becomes E[x(τ − 1(y ∗ + ε < xβ(τ )))] = 0. (2.3) For τ �= .5, the presence of ε will result in the FOC being satisfied at a different estimate of β than in equation (2.2) even in the case where ε is symmetrically distributed because of the asymmetry of the check function. In other words, in the minimization problem, observations for 4 HAUSMAN, LUO, AND PALMER which y ∗ ≥ xβ(τ ) and should therefore get a weight of τ may end up on the left-hand side of the check function, receiving a weight of (1−τ ) . Thus, equal-sized differences on either side of zero do not cancel each other out. For median regression, τ = .5 and so ρ.5 (·) is symmetric around zero. This means that if ε is symmetrically distributed and β(τ ) symmetrically distributed around τ = .5 (as would be the case, for example, if β(τ ) were linear in τ ), the expectation in equation (2.3) holds for the true β0 (τ ). However, for non-symmetric ε, equation (2.3) is not satisfied at the true β0 (τ ). Below we demonstrate a simple Monte-Carlo example to show the existence of the bias in the quantile regression with random disturbances on the dependent variable y. Example. The Data-Generating Process for the Monte-Carlo results is yi = β0 (ui ) + x1i β1 (ui ) + x2i β2 (ui ) + εi with the measurement error εi distributed as N (0, σ 2 ) and the unobserved conditional quantile ui of observation i following ui ∼ U [0, 1]. The coefficient function β(τ ) has components β0 (τ ) = √ 0, β1 (τ ) = exp(τ ), and β2 (τ ) = τ . The variables x1 and x2 are drawn from independent lognormal distributions LN (0, 1). The number of observations is 1,000. Table 1 presents Monte-Carlo results for three cases: when there is no measurement error and when the variance of ε equals 4 and 16. The simulation results show that under the presence of measurement error, the quantile regression estimator is severely biased.5 Comparing the mean bias when the variance of the measurement error increases from 4 to 16 shows that the bias is increasing in the variance of the measurement error. Intuitively, the information of the functional parameter β(·) is decaying when the variance of the EIV becomes larger. 2.1. Identification and Regularity Conditions. In the linear quantile model, it is assumed that for any x ∈ X , xβ(τ ) is increasing. Suppose x1 , ..., xdx are dx -dimensional linearly independent vectors in int(X ). So Qy∗ (τ |xi ) = xi β(τ ) must be strictly increasing in τ . Consider the linear transformation of the model with matrix A = [x1 , ..., xdx ]� : Q∗y (τ |x) = xβ(τ ) = (xA−1 )(Aβ(τ )). � ) = Aβ(τ ). The transformed model becomes Let x � = xA−1 and β(τ � ), Q∗y (τ |� x) = x �β(τ (2.4) (2.5) with every coefficient β�i (τ ) being weakly increasing in τ for 1 ≤ i ≤ k. Therefore, WLOG, we can assume that the coefficients β(τ ) are increasing. Such βi (τ ), 1 ≤ i ≤ k, are called comonotonic functions. From now on, we will assume that βi (τ ) is an increasing function which � ). All of the convergence and asymptotic results for β�i (τ ) hold for the has the properties of β(τ parameter βi after the inverse transform A−1 . 5A notable exception to this is the median regression results. We find that for symmetrically distributed EIV and uniformly distributed β(τ ), the median regression results are unbiased. ERRORS IN THE DEPENDENT VARIABLE OF QUANTILE REGRESSION MODELS 5 Table 1. Monte-Carlo Results: Mean Bias Parameter β1 (τ ) = eτ β2 (τ ) = √ τ EIV Distribution ε =0 ε ∼ N (0, 4) ε ∼ N (0, 16) True parameter: ε =0 ε ∼ N (0, 4) ε ∼ N (0, 16) True parameter: Quantile (τ ) 0.1 0.25 0.5 0.75 0.006 0.003 0.002 0.000 0.196 0.155 0.031 -0.154 0.305 0.246 0.054 -0.219 1.105 1.284 1.649 2.117 0.000 -0.003 -0.005 -0.006 0.161 0.068 -0.026 -0.088 0.219 0.101 -0.031 -0.128 0.316 0.5 0.707 0.866 0.9 -0.005 -0.272 -0.391 2.46 -0.006 -0.115 -0.174 0.949 Notes: Table reports mean bias (across 500 simulations) of slope coefficients estimated for each quantile τ from standard quantile regression of y on a constant, x1 , and x2 where y = x1 β1 (τ ) + x2 β2 (τ ) + ε and ε is either zero (no measurement error case, i.e. y ∗ is observed) or ε is distributed normally with variance 4 or 16. The covariates x1 and x2 are i.i.d. draws from LN (0, 1). N = 1, 000. Condition C 1 (Properties of β(·)). We assume the following properties on the coefficient vectors β(τ ): (1) β(τ ) is in the space M [B1 × B2 × B3 ... × Bdx ] where the functional space M is defined as the collection of all functions f = (f1 , ..., fdx ) : [0, 1] → [B1 ×...×Bdx ] with Bi ⊂ R being a closed interval ∀ i ∈ {1, ..., dx } such that each entry fi : [0, 1] → Bi is monotonically increasing in τ . (2) Let Bi = [li , ui ] so that li < β0i (τ ) < ui ∀ i ∈ {1, ..., dx } and τ ∈ [0, 1]. (3) β0 is a vector of C 1 functions with derivative bounded from below by a positive constant ∀ i ∈ {1, ..., dx }. (4) The domain of the parameter σ is a compact space Ω and the true value σ0 is in the interior of Ω. Under assumption C1 it is easy to see that the parameter space D := M × Ω is compact. Lemma 1. The space M [B1 × B2 × B3 ... × Bk ] is a compact and complete space under Lp , for any p ≥ 1. Proof. See Appendix B.1. � Monotonicity of βi (·) is important for identification, because in the log-likelihood function, ´1 f (y|x) = 0 f (y − xβ(u)|σ)du is invariant when the distribution of random variable β(u) is invariant. The function β(·) is therefore unidentified if we do not impose further restrictions. Given the distribution of the random variable {β(u)|u ∈ [0, 1]dx }, the vector of functions β : [0, 1] → B1 × B2 × B3 ... × Bdx is unique under the rearrangement if the functions βi , 1 ≤ i ≤ dx , are co-monotonic. 6 HAUSMAN, LUO, AND PALMER Condition C2 (Properties of x). We assume the following properties of the design matrix x: (1) E[x� x] is non-singular. (2) The domain of x, denoted as X , is continuous on at least one dimension, i.e. there exists k ∈ {1, ..., dx } such that for every feasible x−k , there is a open set Xk ⊂ R such that (Xk, x{−k} ) ⊂ X . (3) Without loss of generality, βk (0) ≥ 0. Condition C3 (Properties of EIV). We assume the following properties of the measurement error ε: (1) (2) (3) (4) (5) (6) The probability function f (ε|σ) is differentiable in σ. For all σ ∈ Ω, there exists a uniform constant C > 0 such that E[| log f (ε|σ)|] < C. f (·) is non-zero all over the space R, and bounded from above. E[ε] = 0. ´∞ Denote φ(s|σ) := −∞ exp(isε)f (ε|σ)dε as the characteristic function of ε. Assume for any σ1 and σ2 in the domain of σ, denoted as Ω, there exists a neighborhood �∞ 1) k of 0, such that φ(s|σ k=2 ak (is) . φ(s|σ2 ) can be expanded as 1 + Denote θ := (β(·), σ) ∈ D. For any θ, define the expected log-likelihood function L(θ) as follows: L(θ) = E[log g(y|x, θ)], (2.6) where the density function g(y|x, θ) is defined as ˆ 1 g(y|x, θ) = f (y − xβ(u)|σ)du. (2.7) 0 And define the empirical likelihood as Ln (θ) = En [log g(y|x, θ)], (2.8) The main identification results rely on the monotonicity of xβ(τ ). The global identification condition is the following: Condition C4 (Identification). There does not exist (β1 , σ1 ) �= (β0 , σ0 ) in parameter space D such that g(y|x, β1 , σ1 ) = g(y|x, β0 , σ0 ) for all (x, y) with positive continuous density or positive mass. Lemma 2 (Nonparametric Global Identification). Under condition C1-C3, for any β(·) and f (·) which generates the same density of y|x almost everywhere as the true function β0 (·) and f0 (·), it must be that: β(τ ) = β0 (τ ) f (ε) = f0 (ε). Proof. See Appendix B.1. � ERRORS IN THE DEPENDENT VARIABLE OF QUANTILE REGRESSION MODELS 7 We also summarize the local identification condition as follows: Lemma 3 (Local Identification). Define p(·) and δ(·) as functions that measure the deviation of a given β or σ from the truth: p(τ ) = β(τ ) − β0 (τ ) and δ(σ) = σ − σ0 . Then there does not exist a function p(τ ) ∈ L2 [0, 1] and δ ∈ Rdσ such that for almost all (x, y) with positive continuous density or positive mass ˆ 1 ˆ 1 f (y − xβ(τ )|σ)xp(τ )dτ = δ � fσ (y − xβ(τ ))dτ, 0 0 except that p(τ ) = 0 and δ = 0. � Proof. See Appendix B.1. Condition C5 (Stronger local identification for σ). For any δ ∈ S dσ −1 , ˆ 1 ˆ 1 inf ( fy (y − xβ(τ )|σ)xp(τ )dτ − δ � fσ (y − xβ(τ ))dτ )2 > 0. p(τ )∈L2 [0,1] 0 (2.9) 0 3. Maximum Likelihood Estimator 3.1. Consistency. The ML estimator is defined as: � σ (β(·), �) ∈ arg max (β(·),σ)∈D En [g(y|x, β(·), σ)]. (3.1) The following theorem states the consistency property of the ML estimator. Lemma 4 (MLE Consistency). Under conditions C1-C3, the random coefficients β(τ ) and the parameter σ that determines the distribution of ε are identified in the parameter space D. The maximum-likelihood estimator � ˆ 1 � � (β(·), σ �) ∈ arg max En log f (y − xβ(τ )|σ)dτ (β(·),σ)∈D 0 exists and converges to the true parameter (β0 (·), σ0 ) under the L∞ norm in the functional space M and Euclidean norm in Ω with probability approaching 1. Proof. See Appendix B.2. � The identification theorem is a special version of a general MLE consistency theorem (Van der Vaart, 2000). Two conditions play critical roles here: the co-monotonicity of the β(·) function and the local continuity of at least one right-hand side variable. If we do not restrict the estimator in the family of monotone functions, then we will lose compactness of the parameter space D and the consistency argument will fail. 3.2. Ill-posed Fredholm Integration of the First Kind. In the usual MLE setting with the parameter being finite dimensional, the Fisher information matrix I is defined as: 8 HAUSMAN, LUO, AND PALMER � � ∂f ∂f � I := E . ∂θ ∂θ In our case, since the parameter is continuous, the informational kernel I(u, v) is defined as6 �� � �ˆ 1 �� fβ(u) gσ � fv gσ � I(u, v)[p(v), δ] = E , p(v)dv + δ . (3.2) g g g 0 g If we assume that we know the true σ, i.e., δ = 0, then the informational kernel becomes I0 (u, v) := E[ ∂fu ∂fu � ]. g g By condition C5, the eigenvalue of I and I0 are both non-zero. Since E[X � X] < ∞, I0 (u, v) is a compact (Hilbert-Schmidt) operator. From the Riesz Representation Theorem, L(u, v) has countable eigenvalues λ1 ≥ λ2 ≥ ... > 0 with 0 being the only limit point of this sequence. It can be written as the following form: I0 (·) = ∞ � i=1 λi �ψi , ·� ψi where ψi is the system of orthogonal functional basis, i = 1, 2, .... For any function s(·), ´1 0 I0 (u, v)h(v)dv = s(u) is called the Fredholm integral equation of the first kind. Thus, the first-order condition of the ML estimator is ill-posed. In the next subsection we establish convergence rate results for the ML estimator using a deconvolution method.7 Although the estimation problem is ill-posed for the function β, the estimation problem is not ill-posed for the finite dimensional parameter σ, given that E[gσ gσ� ] is non-singular. In the √ Lemma below, we show that σ converges to σ0 at rate 1/ n. Lemma 5 (Estimation of σ). If E[gσ� gσ�� ] is positive definite and conditions C1-C5 hold, the ML estimator σ � has the following property: Proof. See Appendix B.2. 1 σ � − σ0 → Op ( √ ). n (3.3) � 3.3. Bounds for the Maximum Likelihood Estimator. In this subsection, we use the deconvolution method to establish bounds for the maximum likelihood estimator. Recall that ∂g 6For notational convenience, we abbreviate ∂f (y−xβ0 (τ )|σ0 ) as f β(τ ) , g(y|x, β0 (τ ), σ0 ) as g, and ∂σ as gσ . ∂β(τ ) 7Kuhn(1990) shows that if the integral kernel is positive definite with smooth degree r, then the eigenvalue λi is decaying with speed O(n−r−1 ). However, it is generally more difficult to obtain a lower bound for the eigenvalues. The decay speed of the eigenvalues of the information kernel is essential in obtaining convergence rate. And from the Kuhn (1990) results, we see that the decay speed is linked with the degree of smoothness of the function f . The less smooth the function f is, the slower the decaying speed is. Later in the paper we will show that by assuming some discontinuity conditions, we can obtain a polynomial rate of convergence. ERRORS IN THE DEPENDENT VARIABLE OF QUANTILE REGRESSION MODELS 9 the maximum likelihood estimator is the solution (β, σ) = argmax(β,σ)∈D En [log(g(y|x, β, σ))] 1 Define χ(s0 ) = sup|s|≤s0 | φε (s|σ |. We use the following smoothness assumptions on the 0) distribution of ε (see Evodokimov, 2010). Condition C6 (Ordinary Smoothness (OS)). χ(s0 ) ≤ C(1 + |s0 |λ ). Condition C7 (Super Smoothness (SS)). χ(s0 ) ≤ C1 (1 + |s0 |C2 ) exp(|s0 |λ /C3 ). The Laplace distribution L(0, b) satisfies the Ordinary Smoothness (OS) condition with λ = 2. The Chi-2 Distribution χ2k satisfies the OS condition with λ = k2 . The Gamma Distribution Γ(k, θ) satisfies the OS condition with λ = k. The Exponential Distribution satisfies condition OS with λ = 1. The Cauchy Distribution satisfies the Super Smoothness (SS) condition with λ = 1. The Normal Distribution satisfies the SS condition with λ = 2. Lemma 6 (MLE Convergence Speed). Under assumptions C1-C5 and the results in Lemma 4, (1) if Ordinary Smoothness holds (C6), then the ML estimator of β� satisfies for all τ � ) − β0 (τ )| � n |β(τ 1 − 2(1+λ) (2) if Super Smoothness holds (C7), then the ML estimator of β� satisfies for all τ 1 Proof. See Appendix B.2. � ) − β0 (τ )| � log(n)− λ . |β(τ � 4. Sieve estimation In the last section we demonstrated that the maximum likelihood estimator restricted to parameter space D converges to the true parameter with probability approaching 1. However, the estimator still lives in a large space with β(·) being k-dimensional co-monotone functions and σ being a finite dimensional parameter. Although theoretically such an estimator does exist, in practice it is computationally infeasible to search for the maximizer within this large space. Instead, we consider a series estimator of β(·) to mimic the co-monotone functions β(·). There can be many choices of series, for example, triangular series, (Chebyshev) power series, B-splines, etc. In this paper, we choose to use splines because of their computational advantages in calculating the Sieve estimator used in the EM algorithm. For simplicity, we use the piecewise constant Sieve space. Below we introduce the formal definition of our Sieve space Definition 1. Define Dk = Θk ×Ω, where Θk stands for increasing piecewise constant functions on [0, 1]dx with knots at ki , i = 0, 1, ..., k − 1. In other words, for any β ∈ Θk , βj is a piecewise constant function on intervals [ ki , i+1 k ) for 0 ≤ i ≤ k and 1 ≤ j ≤ dx . Furthermore, let Pk denote the product space of all dx -dimensional piecewise-constant functions and Rdσ . 10 HAUSMAN, LUO, AND PALMER We know that the L2 distance of the space Dk to the true parameter θ0 satisfies d2 (θ0 , Dk ) ≤ C for some generic constant C. The Sieve estimator is defined as follows: 1 k Definition 2 (Sieve Estimator). (βk , σk ) = arg max En [log g(y|x, β, σ)] (β,σ)∈Dk (4.1) ´ 2 1 )−g(y|x,θ2 )) Let d(θ1 , θ2 )2 := E[ R (g(y|x,θg(y|x,θ dy] be a pseudo metric on the parameter space D. 0) We know that d(θ, θ0 ) ≤ L(θ0 |θ0 ) − L(θ|θ0 ). Let || · ||d be a norm such that ||θ0 − θ||d ≤ Cd(θ0 , θ). Our metric || · ||d here is chosen to be: ||θ||2d := �θ, I(u, v)[θ]� . (4.2) By Condition C4, || · ||d is indeed a metric. For the Sieve space Dk and any θk ∈ Dk , define I˜ as the following matrix: ´ I˜ := E 1 k 0 fτ dτ, ´ 2 k 1 k fτ dτ, ..., g ´1 k−1 k ´ 1 � ´ k2 ´1 k fτ dτ g f dτ, f dτ, ..., f dτ k−1 τ τ τ 1 gσ σ 0 k k , , . g g g Again, by Condition C4, this matrix I˜ is non-singular. Furthermore, if a certain discontinuity condition is assumed, the smallest eigenvalue of I˜ can be proved to be bounded away from some polynomial of k. Condition C8 (Discontinuity of f ). Suppose there exists a positive integer λ such that f ∈ C λ−1 (R), and the λth order derivative of f equals: f (λ) (x) = h(x) + δ(x − a), (4.3) with h(x) being a bounded function and L1 Lipschitz except at a, and δ(x − a) is a Dirac δ-function at a. Remark. The Laplace distribution satisfies the above assumption with λ = 1. There is an intrinsic link between the above discontinuity condition and the tail property of the characteristic function stated as the (OS) condition. Because φf � (s) = isφf (s), we know that 1 φf (s) = (is) λ φf (λ) (s), while φf (λ) (s) = O(1) under the above assumption. Therefore, assumption C9 indicates that χf (s) := sup | φf1(s) | ≤ C(1 + sλ ). In general, these results will hold as long as the number of Dirac functions in f (λ) are finite in (4.3). ˜ The following Lemma establishes the decay speed of the minimum eigenvalue of I. Lemma 7. If the function f satisfies condition C8 with degree λ > 0, ERRORS IN THE DEPENDENT VARIABLE OF QUANTILE REGRESSION MODELS 11 ˜ denoted as r(I), ˜ has the following property: (1) the minimum eigenvalue of I, 1 ˜ � r(I). kλ (2) 1 kλ � supθ∈Pk ||θ||d ||θ|| � Proof. See Appendix B.3. The following Lemma establishes the consistency of the Sieve estimator. Lemma 8 (Sieve Estimator Consistency). If conditions C1-C6 and C9 hold, The Sieve estimator defined in (4.1) is consistent. Given the identification assumptions in the last section, if the problem is identified, then there exists a neighborhood N of θ0 , for any θ ∈ N , θ �= θ0 , we have: �ˆ � (g(y|x, θ0 ) − g(y|x, θ))2 L(θ0 |θ0 ) − L(θ|θ0 ) ≥ Ex > 0. (4.4) g(y|x, θ0 ) Y � Proof. See Appendix B.3. Unlike the usual Sieve estimation problem, our problem is ill-posed with decaying eigenvalue with speed k λ . However, the curse of dimensionality does not show up because of comonotonicity: all entries of vector of functions β(·) are function of a single variable τ . Therefore it is possible to use Sieve estimation to approximate the true functional parameter with k √ growing slower than n. We summarize the tail property of f in the following condition: Condition C9. For any θk ∈ Θk , assume there exists a generic constant C such that V ar(( C for any βk and σ in a fixed neighborhood of θ0 . fβk gσ g0 , g0 )) Remark. The above condition is true for the Normal, Laplace, and Beta Distributions. ˜ Suppose the smallest eigenvalue of I, ˜ r(I) ˜ = ckλ . Suppose Condition C10 (Eigenvector of I). k ˜ and that k −α � min |vi | for some fixed α ≥ 1 . v is a normalized eigenvector of r(I), 2 Theorem 1. Under conditions C1-C5 and C8-C10, the following results holds for the Sieve-ML estimator: (1) If the number of knots k satisfies the following growth condition: 2λ+1 (a) k √n → 0, (b) λ+r k√ n → ∞, λ k then ||θ − θ0 || = Op ( √ ). n (2) If condition 11 holds and the following growth conditions hold 2λ+1 (a) k √n → 0 (b) kλ+r−α √ n → ∞, < 12 HAUSMAN, LUO, AND PALMER then (a) for every i = 1, ..., k, there exists a number µijk , such that and µijk kr → 0, kλ−α µijk = O(1), µijk (βk,j (τi ) − β0,j (τi )) − → N (0, 1). d and (b) for the parameter σ, there exists a positive definite matrix V of dimension dσ × dσ such that the sieve estimator satisfies: √ n(σk − σ0 ) → N (0, V ). � Proof. See Appendix B.3. Once we fix the number of interior points, we can perform MLE to estimate the Sieve estimator. We discuss how to compute the Sieve-ML estimator in the next section. 4.1. Inference via Bootstrap. In the last section we proved asymptotic normality for the Sieve-ML estimator θ = (β(τ ), σ). However, computing the convergence speed µijk for βk,j (τi ) by explicit formula can be difficult in general. To conduct inference, we recommend using the nonparametric bootstrap. B Define (xB i , yi ) as a resampling of data (xi , yi ) with replacement. Define the estimator B B θkB = arg max EB n [log g(yi |xi , θk )], θk ∈Θk (4.5) B B where EB n denotes the operator of empirical average of resampled data (xi , yi ). Chen and Pouzo (2013) establishes results in validity of the nonparametric bootstrap in semiparametric models for a general functional of a parameter θ. The Lemma below is an implementation of theorem (5.1) of Chen and Pouzo (2013). Lemma 9 (Validity of the Bootstrap). Under condition C1-C6 and C9, choosing the number of knots k according to the condition stated in Theorem 1, the estimator via bootstrap defined in equation (4.10) has the following property: βk,j (τ )B − βk,j (τ ) ∗ − → N (0, 1) d µijk Proof. See Appendix B.3. (4.6) � 5. Algorithm for Computing Sieve MLE 5.1. Global Optimization and Flexible Distribution Function. In the maximization problem (4.1), the dimension of the parameter is possibly large. As the number of knots k grows, the dimensionality of the parameter space M becomes large compared to usual practical econometric applications. In addition, the parameter space Ω can be complex if we consider more flexible distribution families of the EIV, e.g., mixture of normals. ERRORS IN THE DEPENDENT VARIABLE OF QUANTILE REGRESSION MODELS 13 More specifically, assume the number of mixture components is w. Then the parameter space Ω is defined as a 3w − 2 dimensional space. Consider (p, µ, σ) ∈ Ω. p ∈ [0, 1]w−1 is a w − 1 dimensional vector indicating the weight of � w−1 is the first w − 1 components. The weight of the last component is 1 − w−1 i=1 pi . µ ∈ R defined as the mean of the first w − 1 components. The mean of the last component is defined � as − w−1 i=1 pi µi � . w−1 i=1 pi σ ∈ Rw is defined as the vector of standard deviation of each component. In general, if the number of knots is k, and the number of mixture components is w, then the total number of parameters is dx k + 3w − 2. Since the number of parameters is potentially large and the maximization problem (4.1) is typically highly nonlinear, usual (local) optimization algorithms do not generally work well and can be trapped at a local optimum. We recommend using global optimization algorithms such as the genetic algorithm or the efficient global optimization algorithm (EGO) to compute the sieve estimator suggested. While such global algorithms have good precision, these methods can be slow. Thus, we provide an EM-algorithm which is locally computational tractable in the next subsection. 5.2. EM Algorithm for Finding Local Optima of the Likelihood Function. In this section, we propose a special EM algorithm in real application to generate a consistent estimator. Definition 3. Define κ(τ |x, y, θ) := Define ´1 0 f (y−xβ(τ )|σ) f (y−xβ(τ )|σ)dτ as the Bayesian probability of τ |x, y, θ. H(θ� |θ) = Ex,τ [log κ(τ |x, y, θ� )|x, y, θ] �ˆ � ∞ f (y − xβ(τ )|σ) f (y − xβ � (τ )|σ � ) = Ex,τ log ´ 1 dy ´1 � � −∞ 0 f (y − xβ(u)|σ)du 0 f (y − xβ (u)|σ )du as the expectation of the log Bayesian likelihood over τ ∈ [0, 1] conditional on observation (x, y) and parameter θ� , given a prior θ. Define Q(θ� |θ) = Ex,τ [κ(τ |x, y) log f (y − xβ(τ ))] = L(θ� ) + H(θ� |θ) In the EM algorithm setting, each maximization step will weakly increase the likelihood L, and thus the algorithm leads to maximum likelihood estimator of the likelihood function L. Lemma 10 (Local Optimality of the EM Algorithm). The EM Algorithm solves the following problem: θp+1 ∈ arg max Q(θ|θp ) (5.1) θ produces a sequence θp given an initial value θ0 . If the sequence has the following property: 14 HAUSMAN, LUO, AND PALMER (1) The sequence L(θp ) is bounded. (2) L(θp+1 |θp ) − L(θp |θp ) ≥ 0. Then the sequence θp converges to some θ∗ in D. And by property (2), θ∗ is a local optimum of the log likelihood function L. � Proof. See Appendix B.3. This Lemma says, instead of maximizing the log likelihood function, we can run iterations conditional on the previous estimation of θ to obtain a sequence of estimates θp converging to the maximum likelihood estimator. The computation steps are much more attractive than the direct minimization of the log-likelihood function L(θ), since in each step we only need to do the partial maximization taking the prior estimates as known parameters. The next section presents a simpler version of the algorithm as weighted least squares when the distribution function f (·|σ) is in the normal family. The maximization of the Q function based on the last step is simpler than the global optimization algorithm in the following sense: Unlike the direct maximization of L(θ) in the space D, the maximization of the Q(|θp ) function is a process of solving maximization problems for each different τ , which involves many fewer number of parameters. More specifically, the maximization problem in equation (5.5) is equivalent to the following maximization problem for each different τ given fixed σ � : � � � f (y − xβ(τ )|σ) � � � log f (y − xβ (τ )|σ )dy . (5.2) f (y − xβ(u)|σ)du 0 The EM algorithm is much easier to adopt compared to searching in the functional space M to solve for the global maximizer of the log-likelihood function L(θ). The EM algorithm is useful to obtain a local optimum, but not necessarily a global optimum. However, in practice, if we start with a solver of (4.1) by using the global optimization algorithm and then apply the EM algorithm, we could potentially improve the speed of computation compared to directly using the genetic algorithm. (β(τ ) , σ ) ∈ arg max Ex ´ 1 5.3. Weighted Least Squares. Under normality assumption of the EIV term ε, the maximization of Q(·|θ) will be extremely simple: it is reduced to the minimization of a weighted least square problem. In the last section, in equation (3.21), suppose the disturbance ε ∼ N (0, σ 2 ). Then the maximization problem (4.1) becomes the following, with the parameter θ = [β(·), σ] � � � � � arg max Q(θ |θ) := E log(f (y − xβ (τ ))|θ )κ(x, y, θ)|θ (5.3) � =E �ˆ θ 1 τ =0 ´1 0 f (y − xβ(τ )|σ) f (y − xβ(u)|σ)du � 1 (y − xβ � (τ ))2 − log(2πσ �2 ) − 2 2σ �2 � � dτ . It is easy to see from the above equation that the maximization problem of β � (·)|θ is to minimize the sum of weighted least squares. As in standard normal MLE, the FOC for β � (·) ERRORS IN THE DEPENDENT VARIABLE OF QUANTILE REGRESSION MODELS 15 does not depend on σ �2 . The σ �2 is solved after all the β � (τ ) are solved from equation (5.3). Therefore, the EM algorithm becomes a simple weighted least square algorithm, which is both computationally tractable and easy to implement in practice. Given an initial estimate of a weighting matrix W , the weighted least squares estimates of β and σ are � k ) = (X � Wk X)−1 X � Wk y β(τ � 1 �� σ � = wik ε�2ik NK k i where Wk is the diagonal matrix formed from the k th column of W , which has elements wik . � k ) and σ Given estimates ε�k = y − X β(τ �, the weights wik for observation i in the estimation of β(τk ) are φ (� εik /� σ) wik = 1 � (5.4) φ (� εik /� σ) K k where K is the number of τ s in the sieve, e.g. K = 9 if the quantile grid is {τk } = {0.1, 0.2, ..., 0.9}, and φ(·) is the pdf of a standard normal distribution. 6. Monte-Carlo Simulations We examine the properties of our estimator empirically in Monte-Carlo simulations. Let the data-generating process be yi = β0 (ui ) + x1i β1 (ui ) + x2i β2 (ui ) + εi where N = 1, 000, the conditional quantile ui of each individual is u ∼ U [0, 1], and the measurement error ε is drawn from a mixed normal distribution N (−3, 1) with probability 0.5 εi ∼ N (2, 1) N (4, 1) with probability 0.25 with probability 0.25 The covariates are functions of random normals x1i = log |z1i | + 1 x2i = log |z2i | + 1 where z1 , z2 ∼ N (0, I2 ). Finally, the coefficient vector is a function of the conditional quantile ui of individual i β0 (u) 1 + 2u − u2 β1 (u) = 12 exp(u) . β2 (u) u+1 16 HAUSMAN, LUO, AND PALMER −.6 −.4 −.2 Mean Bias 0 .2 .4 .6 Figure 2. Monte Carlo Simulation Results: Mean Bias of β1 (τ ) 0 .1 .2 .3 .4 .5 Quantile .6 .7 .8 .9 1 Quantile Regression Bias−Corrected Bias−Corrected 95% Confidence Intervals Simulation Results. In Figures 2 and 3, we plot the mean bias of quantile regression of y on a constant, x1 , and x2 and contrast that with the mean bias of our estimator. Quantile regression is badly biased, with lower quantiles biased upwards towards the median-regression coefficients and upper quantiles biased downwards towards the median-regression coefficients. We also plot bootstrapped 95% confidence intervals (dashed green lines) to show the precision of our estimator. Especially in the case of nonlinear β1 (·) in Figure 2, we reject the quantile coefficients, which fall outside the 95% confidence interval of our estimates. Figure 4 shows the true mixed-normal distribution of the measurement error ε as defined above (solid blue line) contrasted with the estimated distribution of the measurement error (dotted red line) and the 95% confidence interval of the estimated density (dashed green line). Despite the bimodal nature of the true measurement error distribution, our algorithm successfully estimates its distribution with very little bias and relatively tight confidence bounds. 7. Empirical Application Angrist et al. (2006) estimate that the heterogeneity in the returns to education across the conditional wage distribution has increased over time, as seen below in Figure 5, reproduced ERRORS IN THE DEPENDENT VARIABLE OF QUANTILE REGRESSION MODELS 17 −.6 −.4 −.2 Mean Bias 0 .2 .4 .6 Figure 3. Monte Carlo Simulation Results: Mean Bias of β2 (τ ) 0 .1 .2 .3 .4 .5 Quantile .6 .7 .8 .9 1 Quantile Regression Bias−Corrected Bias−Corrected 95% Confidence Intervals from Angrist et al. Estimating the quantile-regression analog of a simple Mincer regression, they estimate models of the form qy|X (τ ) = β0 (τ ) + β1 (τ )educationi + β2 (τ )experiencei + β3 (τ )experience2i + β4 (τ )blacki (7.1) where qy|X (τ ) is the τ th quantile of the conditional (on the covariates X) log-wage distribution, the education and experience variables are measured in years, and black is an indicator variable. They find that in 1980, an additional year of education was associated with a 7% increase in wages across all quantiles, nearly identical to OLS estimates. In 1990, they again find that most of the wage distribution has a similar education-wage gradient but that higher conditional (wage) quantiles saw a slightly stronger association between education and wages. By 2000, the education coefficient was almost seven percentage points higher for the 95th percentile than for the 5th percentile. We observe a different pattern when we correct for measurement-error bias in the selfreported wages used in the Angrist et al. census data to compare results using our estimation procedure with standard quantile regression.8 We first use our genetic algorithm to maximize 8We are able to replicate the Angrist et al. quantile regression results (shown in the dashed blue line) almost exactly. We estimate β(τ ) over a τ grid of 99 knots that stretches to 0.01 and 0.99. However, we do not report results for quantiles τ < .1 or τ > .9 because of how unstable and noisy these results are. 18 HAUSMAN, LUO, AND PALMER 0 .05 Density .1 .15 .2 Figure 4. Monte Carlo Simulation Results: Distribution of Measurement Error −8 −6 −4 True Density −2 0 2 Measurement Error Estimated Density 4 6 8 95% CI the Sieve likelihood with an EIV density f that is a mean-zero mixture of three normal distributions with unknown variances. The genetic algorithm excels at optimizing over non-smooth surfaces where other algorithms would be sensitive to start values. However, running the genetic algorithm on our entire dataset is computationally costly since the data contain over 65,000 observations for each year and our parameter space is large. To speed up convergence, we average coefficients estimated on 100 subsamples of 2,000 observations each. This procedure gives consistent parameter estimates and nonparametric confidence intervals provided the ratio of the subsample size and the original number of observations goes to zero in the limit. The 95% nonparametric confidence intervals are calculated pointwise as the .025 and .975 quantiles of the empirical distribution of the subsampling parameter estimates. Figure 6 plots the estimated distribution of the measurement error (solid blue line) in the 2000 data, and the dashed red lines report the 95% confidence bounds on this estimate, obtained via subsampling. The estimated EIV distributions for the 1980 and 1990 data look similar—we find that the distribution of the measurement error in self-reported log wages is approximately Gaussian. Figure 7 plots the education coefficient β�1 (τ ) from estimating equation (7.1) by weightedleast squares.9 While not causal estimates of the effect of education on wages, the results 9Because preliminary testing has shown that the weighted-least squares approach improves upon the optimum reached by the genetic algorithm, we continue to refine our genetic algorithm. The coefficient results we report ERRORS IN THE DEPENDENT VARIABLE OF QUANTILE REGRESSION MODELS 19 Figure 5. Angrist, Chernozhukov, and Fernandez-Val (2006) Figure 2 suggest that in 1980, education was more valuable at the middle and top of the conditional income distribution. This pattern contrasts with the standard quantile-regression results, which show constant returns to education across quantiles of the conditional wage distribution (as discussed above). For 1990, the pattern of increasing returns to education for higher quantiles is again visible with the very highest quantiles seeing a large increase in the education-wage gradient. The estimated return to education for the .9 quantile is over twice as large as estimates obtained from traditional quantile regression. In the 2000 decennial census, the returns to education appear nearly constant across most of the distribution, with two notable exceptions. The lowest income quantiles have a noticeably lower wage gain from a year of education than middle-income quantiles. Second, the top-income increases in the education-wage premium that were present in the 1990 results are higher in 2000. The 90th percentile of the conditional wage distribution in 2000 increases by over 30 log points for every year of education. Over time— especially between 1980 and 1990—we see an overall increase in the returns to education. While this trend is consistent with the observations of Angrist et al. (2006), the distributional story below are from weighted-least squares, which assumes normality of the measurement error. As Figure 6 shows, this assumption seems to be consistent with the data. 20 HAUSMAN, LUO, AND PALMER 0 .2 Density .4 .6 .8 Figure 6. Estimated Distribution of Wage Measurement Error −4 −2 0 Measurement Error Estimated Density 2 4 95% Confidence Intervals Notes: Graph plots the estimated distribution of the measurement eror that emerges from correcting for measurement error suggests that the returns to education have become disproportionately concentrated in high-wage professions. 8. Conclusion In this paper, we develop a methodology for estimating the functional parameter β(·) in quantile regression models when there is measurement error in the dependent variable. Assuming that the measurement error follows a distribution that is known up to a finite-dimensional parameter, we establish general convergence speed results for the MLE-based approach. Under a discontinuity assumption (C8), we establish the convergence speed of the Sieve-ML estimator. We use genetic algorithms to maximize the Sieve likelihood, and show that when the distribution of the EIV is normal, the EM algorithm becomes an iterative weighted least square algorithm that is easy to compute. We demonstrate the validity of bootstrapping based on asymptotic normality of our estimator and suggest using a nonparametric bootstrap procedure for inference. Finally, we revisited the Angrist et al. (2006) question of whether the returns to education across the wage distribution have been changing over time. We find a somewhat different ERRORS IN THE DEPENDENT VARIABLE OF QUANTILE REGRESSION MODELS 21 Figure 7. Returns to Education Correcting for LHS Measurement Error .4 .3 .1 0 .1 0 .1 .2 .3 .4 .5 .6 .7 .8 .9 Quantile 2000 .2 .3 .4 1990 .2 .3 .2 .1 0 Education Coefficient .4 1980 .1 .2 .3 .4 .5 .6 .7 .8 .9 Quantile .1 .2 .3 .4 .5 .6 .7 .8 .9 Quantile Quantile Regression Bias−Corrected Quantile Regression Notes: Graphs plot education coefficients estimated using quantile regression and our bias-correcting algorithm. The dependent variable is log wage for prime age males, and other controls include a quadratic in years of experience and a dummy for blacks. The data comes from the indicated decennial Census and consists of 40-49 year old white and black men born in America. The number of observations in each sample is 65,023, 86,785, and 97,397 in 1980, 1990, and 2000, respectively. pattern than prior work, highlighting the importance of correcting for errors in the dependent variable of conditional quantile models. When we correct for likely measurement error in the self-reported wage data, the wages of low-wage men had much less to do with their education levels than they did for those at the top of the conditional-wage distribution. By 2000, the wageeducation gradient was nearly constant across most of the wage distribution, with somewhat lower returns for low-wage earners and significant growth in the predictability of education on top wages. References [1] Angrist, J., Chernozhukov, V. and Fernandez-Val, I. (2006), Quantile Regression under Misspecification, with an Application to the U.S. Wage Structure. Econometrica, 74: 539-563. [2] Burda, M., Harding, M. and Hausman, J. (2010), A Poisson Mixture Model of Discrete Choice, Journal of Econometrics, 166(2) (2010) 184-203. [3] Carroll, Raymond J. and Wei, Y., (2009), "Quantile Regression With Measurement Error", Journal of the American Statistical Association, 104, issue 487, p. 1129-1143. 22 HAUSMAN, LUO, AND PALMER [4] Songnian Chen (2002). "Rank Estimation of Transformation Models," Econometrica, vol. 70(4), pages 1683-1697, July. [5] Xiaohong Chen (2009), "Large sample sieve estimation of semiparametric models", Cowles Foundation Paper No. 1262. [6] Xiaohong Chen, Demian Pouzo (2013), “Sieve quasi likelihood ratio inference on semi/nonparametric conditional moment models”, Cowles fundation paper No. 1897. [7] Chernozhukov, V., Fernandez-Val, I. and Galichon, A., “Improving Point and Interval Estimates of Monotone Functions by Rearrangement" [8] Chernozhukov, Victor and Hansen, Christian (2005), "An IV Model of Quantile Treatment Effects," Econometrica, vol. 73(1), pages 245-261, 01. [9] Chernozhukov, Victor and Hansen, Christian (2008) "Instrumental variable quantile regression: A robust inference approach," Journal of Econometrics, vol. 142(1), pages 379-398, January. [10] Cosslett, S. R. (2004), Efficient Semiparametric Estimation of Censored and Truncated Regressions via a Smoothed Self-Consistency Equation. Econometrica, 72: 1277–1293. [11] Cosslett, S. R. (2007), Efficient Estimation of Semiparametric models by Smoothed Maximum Likelihood. International Economic Review, 48: 1245–1272. [12] Dempster, A.P., Laird. N. M. and Rubin D.B. (1977), Journal of the Royal Statistical Society, Vol. 39, No. 1., pp. 1-38 [13] Evdokimov, K., Identification and Estimation of a Nonparametric Panel Data Model with Unobserved Heterogeneity, mimeo. [14] Hausman J.A., Lo A.W., MacKinlay A.C. An ordered probit analysis of transaction stock prices (1992) Journal of Financial Economics, 31 (3) , pp. 319-379. [15] Hausman J.A., Abrevaya, Jason, and Scott-Morton, Fiona. (1998) Misclassification of a dependent variable in qualitative response model, Journal of econometrics, 87, 239-269. [16] Hausman J.A., (2001) Mismeasured variables in econometrics problems from the right and problems from the left, Journal of Economic perspectives, 15, 57-67. [17] Horowitz, J. L. (2011), Applied Nonparametric Instrumental Variables Estimation. Econometrica, 79: 347394. [18] Kuhn, T., (1990)Eigenvalues of integral operators with smooth positive definite kernels, Archiv der Mathematik 2. 12. 1987, Volume 49, Issue 6, pp 525-534 [19] Shen, Xiaotong, "On methods of Sieve and Penalization", The Annals of Statistics, vol. 25, 1997. [20] Schennach, Susanne M., 2008. "Quantile Regression With Mismeasured Covariates," Econometric Theory, vol. 24(04), pages 1010-1043, August. ERRORS IN THE DEPENDENT VARIABLE OF QUANTILE REGRESSION MODELS 23 Appendix A. Deconvolution Method Although the direct MLE method is used to obtain identification and some asymptotic properties of the estimator, the main asymptotic property of the maximum likelihood estimator depends on discontinuity assumption C.9. Evodokimov (2011) establishes a method using deconvolution to solve a similar problem under panel data condition. He adopts a nonparametric setting and therefore needs kernel estimation method. Instead, in the special case of quantile regression setting, we can further explore the linear structure without using the nonparametric method. Deconvolution offers a straight forward view of the estimation procedure. The CDF function of a random variable w can be computed from the characteristic function φw (s). 1 F (w) = − lim 2 q→∞ ˆ q −q e−iws φw (s)ds. 2πis (A.1) β0 (τ ) satisfying the following conditional moment condition: (A.2) E[F (xβ0 (τ )|x)] = τ. Since we only observe y = xβ(τ ) + ε, the φxβ (s) = E[exp(isy)] . φε (s) Therefore β0 (τ ) satisfies: ˆ q 1 E[e−i(y−xβ0 (τ ))s |x] E[ − τ − lim ds|x] = 0 (A.3) q→∞ −q 2 2πisφε (s) In the above expression, the limit symbol before the integral can not be exchanged with the followed conditional expectation symbol. But in practice, they can be exchanged if the integral is truncated. A simple way to explore information in the above conditional moment equation is to consider the following moment conditions: � � �� ˆ q 1 E[e−i(y−xβ0 (τ ))s |x] E x − τ + Im( lim ds) =0 (A.4) q→∞ −q 2 2πsφε (s) �´ � −i(y−xβ)s|x q Let F (xβ, y, τ, q) = 12 − τ − Im −q e 2πsφε (s) ds . Given fixed truncation threshold q, the optimal estimator of the kind En [f (x)F (xβ, y, τ, q)] is: ∂F |x]/σ(x)2 F (xβ, y, τ, q)] = 0, ∂β where σ(x)2 := E[F (xβ, y, τ )2 |x]. Another convenient estimator is: En [E[ (A.5) 24 HAUSMAN, LUO, AND PALMER arg min En [F (xβ, y, τ, q)2 ]. (A.6) β These estimators have similar behaviors in term of convergence speed, so we will only discuss 2 the optimal estimator. Although it can not be computed directly because the weight ∂F ∂β /σ(x) depends on β, it can be achieved by an iterative method. Theorem 2. (a) Under condition OS with λ ≥ 1, the estimator β satisfying equation (3.15) 1 with truncation threshold qn = CN 2λ satisfies: −1 (A.7) β − β0 = O(N N 2λ ), (b) Under condition SS with, the estimator β� satisfies equation (3.15) with truncation thresh1 old qn = C log(n) λ satisfies: −1 β − β0 = O(log(n) λ ). (A.8) Appendix B. Proofs of Lemmas and Theorems B.1. Lemmas and Theorems in Section 2. In this section we prove the Lemmas and Theorems in section 2. Proof of Lemma 1. Proof. For any sequence of monotone functions f1 , f2 , ..., fn , ... with each one mapping [a, b] into some closed interval [c, d]. For bounded monotonic functions, point convergence means uniform convergence, therefore this space is compact. Hence the product space B1 × B2 × ... × Bk is compact. It is complete since L2 functional space is complete and limit of monotone functions is still monotone. � Proof of Lemma 2. Proof. WLOG, under condition C.1-C.3, we can assume the variable x1 is continuous. If there exists β(·) and f (·) which generates the same density g(y|x, β(·), f ) as the true paramter β0 (·) and f0 (·), then by applying Fourier transformation, φ(s) ˆ 1 0 Denote m(s) = m(s) ˆ 0 φ(s) φ0 (s) =1+ 1 exp(isx−1 β−1 (τ )) exp(isβ(τ ))dτ = φ0 (s) �∞ 1 exp(isβ0 (τ ))dτ. 0 k=2 ak (is) k around a neighborhood of 0. Therefore, ∞ � (is)k xk β1 (τ )k i=0 ˆ 1 k! dτ = ˆ 0 1 exp(isx−1 β0,−1 (τ )) ∞ � (is)k xk β0,1 (τ )k 1 i=0 k! dτ. Since x1 is continuous, then it must be that the corresponding polynomials of x1 are the same for both sides. Namely, ERRORS IN THE DEPENDENT VARIABLE OF QUANTILE REGRESSION MODELS (is)k m(s) k! ˆ 1 (is)k exp(isx−1 β−1 (τ ))β1 (τ ) dτ = k! k 0 1 ˆ 0 25 exp(isx−1 β0.−1 (τ ))β0,1 (τ )k dτ. Divide both sides of the above equation by (is)k /k!, as let s approaching 0, we get: ˆ 1 k β1 (τ ) dτ = 0 ˆ 1 β0,1 (τ )k dτ. 0 By assumption C.2, β1 (·) and β0,1 (·) are both strictly monotone, differentiable and greater than or equal to 0. So β1 (τ ) = β0,1 (τ ) for all τ ∈ [0, 1]. Now consider the same equation considered above. divide both sides by (is)k /k!, we get ´1 ´1 m(s) 0 exp(isx−1 β−1 (τ ))β0,1 (τ )k dτ = 0 exp(isx−1 β0.−1 (τ ))β0,1 (τ )k dτ,for all k ≥ 0. Since β0,1 (τ )k , k ≥ 1 is a functional basis of L2 [0, 1], therefore m(s) exp(isx−1 β−1 (τ )) = exp(isx−1 β0,−1 (τ )) for all s in a neighborhood of 0 and all τ ∈ [0, 1]. If we differentiate both sides with respect to s and evaluate at 0 (notice that m� (0) = 0), we get: x−1 β−1 (τ ) = x−1 β0,−1 (τ ), for all τ ∈ [0, 1]. By assumption C.2, E[x� x] is non-singular. Therefore E[x�−1 x−1 ] is also non-singular. Hence, the above equation suggests β−1 (τ ) = β0,−1 (τ ), for all τ ∈ [0, 1]. Therefore, β(τ ) = β0 (τ ).This implies that φ(s) = φ0 (s). Hence f (ε) = f0 (ε). � Proof of Lemma 3. Proof. Proof: (a) If the equation stated in (a) holds a.e., then we can apply Fourier transformation on both sides, i.e., conditional on x, ˆ e isy ˆ 1 fy (y − xβ(τ ))xt(τ )dτ dy = 0 ˆ e isy ˆ 0 1ˆ 1 0 fσ (y − xβ(τ ))dτ dy. (B.1) Let q(τ ) = xβ(τ ), which is a strictly increasing function. We can use change of variables to change τ to q. We get: ˆ 1 ˆ ˆ 1 ˆ isxβ(τ ) is(y−xβ(τ )) � isxβ(τ ) e xt(τ ) e f (y − xβ(τ ))dydτ = e eis(y−xβ(τ )) fσ (y − xβ(τ ))dydτ. 0 0 (B.2) or, equivalently, ˆ 1 e isxβ(τ ) xt(τ )dτ (−is)φε (s) = 0 ´ � ´ given that f (y)eisy dy = −is f (y)eisy dy. ˆ 0 1 eisxβ(τ ) dτ ∂φε (s) , ∂σ (B.3) 26 HAUSMAN, LUO, AND PALMER Denote ´1 0 eisxβ(τ ) xt(τ )dτ = q(x, s), and ´1 0 eisxβ(τ ) dτ = r(x, s). Therefore, ∂φε (s) . ∂σ ε (s) First we consider normal distribution. φε (s) = exp(−σ 2 s2 /2). So ∂φ∂σ = −σs2 φε (s). Therefore, q(s, x) = isσr(s, x), for all s and x with positive probability. Since xβ(τ ) is strictly increasing, let z = xβ(τ ), and I(y) = 1(xβ(0) ≤ y ≤ xβ(1)). ´∞ ´ ∞ isw xt(z −1 (w)) 1 So q(s, x) = −∞ eisw { xz { xz � (z −1 I(w)}dw. � (z −1 (w)) I(w)}dw, and r(s, x) = −∞ e (w)) q(s, x) = isσr(s, x) means that there is a relationship between the integrands. More specifically, it must be that −isq(s, x)φ(s) = r(s, x) xt(z −1 (w)) 1 I(w) = d � −1 I(w)/dw. (B.4) � −1 xz (z (w)) xz (z (w)) Obviously they don’t equal to each other for a continuous interval of xi , because the latter function has a δ function component. ε (s) ε (s) For more general distribution function f , −isq(s, x)φε (s) = r(s, x) ∂φ∂σ , with ∂φ∂σ = � ∞ 2 j φε (s)s m(s), for some m(s) := j=0 aj (is) . Therefore, −ism(s)r(s, x) = q(s, x). Consider local expansion of s in both sides, we get: −is( ∞ � j=0 aj (is)j )( ∞ ˆ � k=0 1 (xβ(τ ))k dτ 0 ∞ � (is)k )= k! j=0 ˆ 1 (xβ(τ ))k xt(τ )dτ 0 (is)k . k! Compare the k th polynomial of s, we get: ´1 (xβ(τ ))k xt(τ )dτ = 0, if k = 0. ´01 ´1 j−1 � k j j−1 dτ (is) 1≤j<k aj (is) ( 0 xβ(τ )) 0 (xβ(τ )) xt(τ )dτ = − (j−1)! . Let x1 be the continuous variable, and x−1 be the variables left. Let β1 be the coefficient corresponds to x1 and let β−1 be the coefficient corresponds to x−1 . Let t1 be the component in t that corresponds to x1 , and let t−1 be the component in t that corresponds to x−1 . Then, consider the (k + 1)th order polynomial of x1 , we get: ˆ 1 β1 (τ )k t1 (τ )dτ = 0. 0 since β1 (τ )k , k ≥ 0 form a functional basis in L2 [0, 1], t1 (τ ) must equal to 0. Consider the k th order polynomial of x1 , we get: ˆ 0 1 β1 (τ )k (x−1 t−1 (τ ))dτ = 0. ERRORS IN THE DEPENDENT VARIABLE OF QUANTILE REGRESSION MODELS 27 So x−1 t−1 (τ ) = 0 for all τ ∈ [0, 1]. Since E[x−1 x�−1 ] has full rank, it must be that t−1 (τ ) = 0. Hence local identification condition holds. � B.2. Lemmas and Theorems in Section 3. In this subsection, we prove the Lemmas and Theorems in section 3. Lemma 11 (Donskerness of D). The parameter space D is Donsker. Proof. By theorem 2.7.5 of Van der Vaart and Wellner (2000), the space of bounded monotone function, denoted as F, satisfies: r. 1 logN[] (ε, F, Lr (Q)) ≤ K , ε for every probability measure Q and every r ≥ 1, and a constant K which depends only on Since Dis a product space of bounded monotone functions M and a finite dimensional bounded compact set Ω, therefore the braketing number of D given any measure Q is also bounded by: 1 logN[] (ε, D, Lr (Q)) ≤ Kdx , ε Therefore, ´δ� log N[] (ε, D, Q) < ∞, i.e., D is Donsker. 0 � Proof of Lemma 5. Proof. Ln (θ) ≥ Ln (θ0 ) by definition, and we know that L(θ0 ) ≥ L(θ) by information inequality. By Lemma 10, since D is Donsker, then the following stochastic equicontinuity condition holds: √ n(Ln (θ) − Ln (θ1 ) − (L(θ) − L(θ1 ))) →p rn , for some generic term rn = op (1), if |θ − θ1 | ≤ ηn for small enough ηn . Therefore uniform convergence of Ln (θ) − L(θ) holds in D. In addition, by Lemma 2, the parameter θ is globally identified. Also, it is obvious that L is continuous in θ. Therefore, the usual consistency result of MLE estimator applies here. � Proof of Lemma 6. Proof. By definition, En [log(g(y|x, β, σ))] ≥ En [log(g(y|x, β0 , σ0 ))]. (B.5) And we know that by information inequality, E[log(g(y|x, β, σ))] ≤ E[log(g(y|x, β0 , σ0 ))]. Therefore, E[log(g(y|x, β, σ))]−E[log(g(y|x, β0 , σ0 ))] ≥ √1n Gn [log(g(y|x, β, σ))−g(y|x, β0 , σ0 ))]. By Donskerness of parameter space D, the following equicontinuity condition holds: 28 HAUSMAN, LUO, AND PALMER Hence, 1 0 ≤ E[log(g(y|x, β0 , σ0 ))] − E[log(g(y|x, β, σ))] �p √ . (B.6) n √ Because σ is converging to the true parameter in n speed, we will simply use σ0 here in the argument because it won’t affect our analysis later. In the notation, I will abbreviate σ0 . Let z(y, x) = g(y|β, x) − g(y|β0 , x). In equation 3.19, we know that the KL-discrepancy of g(y|x, β0 ) and g(y|x, β), noted as D(g(·|β0 )||g(·|β)), is bounded by C √1n with p → 1. So Ex [||z(y|x)||1 ]2 ≤ D(g(·|β0 )||g(·|β)) ≤ 2(E[log(g(y|x, β0 , σ0 ))] − E[log(g(y|x, β, σ))]) �p ´∞ √1 , where |z(y|x)|1 = −∞ |z(y|x)|dy. n Now consider the characteristic function of y conditional on x: φxβ (s) = ´∞ φxβ0 (s) = ´∞ −∞ g(y|β, x)e isy dy φε (s) , and −∞ g(y|β0 , x)e φε (s) Therefore φxβ (s) − φxβ0 (s) = ||z(y|x)||1 . ´|φqε (s)| −iws limq→∞ −q e 2πs φη (s)ds. and |φxβ (s) − φxβ0 (s)| ≤ Let Fη (w) = η. 1 2 − isy dy ´∞ −∞ z(y|x)e φε (s) . isy dy , (B.7) Then F (w) is the CDF of the random variable ´ q −iws Let Fη (w, q) = 12 − −q e 2πs φη (s)ds. If the p.d.f of η is a C 1 function and η has finite second order moment, then Fη (w, q) = Fη (w) + O( q12 ). Since x and β are bounded, let η = xβ(τ ), then there exists a common C such that |Fxβ (w, q) − Fxβ (w)| ≤ qC2 for all x ∈ X and w ∈ R for q being large enough. ´ q isw ´ q ||z(y|x)||1 Therefore |Fxβ (w, q) − Fxβ0 (w, q)| = | −q e2πs (φxβ (s) − φxβ0 (s))ds| ≤ | −q |φ2πs ds| � ε (s)| ||z(y|x)||1 χ(q). Consider the follow moment condition: E[x(Fxβ0 (xβ0 ) − τ )] = 0 (B.8) Let F0 be Fxβ0 and F be Fxβ . We will replace the moment condition by the truncated function F (w, q). So by the truncation property, for any fixed δ ∈ Rdx : δE[x(F0 (xβ) − F0 (xβ0 ))] = δE[x(F0 (xβ) − F (xβ))] = δE[x(F0 (xβ, q) − F (xβ, q))] + O( 1 ). q2 (B.9) ERRORS IN THE DEPENDENT VARIABLE OF QUANTILE REGRESSION MODELS 29 The first term of the right hand side is bounded by: E[|δx|χ(q)||z(y|x)||1 ] � E[||z(y|x)||1 ]χ(q). And δE[x(F0 (xβ) − F0 (xβ0 ))] = δE[x� xf (xβ0 (τ ))](β(τ ) − β0 (τ ))(1 + o(1)), where f (xβ0 (τ )) is the p.d.f of the random variablexβ0 (·) evaluated at τ . 1 Hence, β(τ ) − β0 (τ ) = Op ( χ(q) 1 ) + O( q 2 ). n4 1 − 1 Under OS condition, χ(q) � q λ , so by choosing q = Cn 4(λ+2) , ||β(τ ) − β0 (τ )||2 �p n 2(λ+2) . 1 Under SS condition, χ(q) � C1 (1 + |q|C2 ) exp(|q|λ /C3 ). By choosing q = C log(n) 4λ , ||β(τ ) − 1 β0 (τ )||2 �p log(n)− 2λ . � B.3. Lemmas and Theorems in Sections 4 and 5. In this subsection we prove the Lemmas and Theorems in section 4. Proof of Lemma 8. Proof. Suppose f satisfies discontinuity condition C8 with degree λ > 0. Denote lk = (fτ1 , fτ2 , ..., fτk , gσ ). ´ (l p� )2 For any (pk , δ) ∈ Pk , pk Ik p�k = E[ R k gk dy] ≥ E[(lk p�k )2 ], since g is bounded from above. 1 Assume c := inf x∈§,τ ∈[0,1] (xβ � (τ )) > 0. Now consider an interval [y, y + 2λkc ]. �λ i 2 b Let S(λ) := i=0 (λCλ ) , where Ca stands for the combinatorial number choosing b elements from a set with size a. So, S(λ)E[ ˆ R λ � (lk p�k )2 y] = E[ (Cλj )2 d j=0 ˆ R (lk (y + j/(2λukc))p�k (y + j/(2λkc)))2 dy]. And by Cauchy-Schwarz inequality, ˆ λ � j 2 E[ (Cλ ) (lk (y + j/(2λukc))p�k (y + j/(2λukc)))2 dy] R j=0 ˆ � λ ≥ E[ ( (−1)j Cλj lk (y + j/(2λukc))p�k (y + j/(2λukc)))2 dy]. R j=0 let Jki := [a + xβ( i+1/2 k )− E[ λ � 1 2λuk , a (Cλj )2 j=0 ≥ E[ ˆ Jki ( λ � j=0 ˆ R + xβ( i+1/2 k )+ 1 2λuk ], so (lk (y + j/(2λukc))p�k (y + j/(2λukc)))2 dy] (−1)j Cλj lk (y + j/(2λukc))p�k (y + j/(2λukc)))2 dy]. 30 HAUSMAN, LUO, AND PALMER By discontinuity assumption, , λ � (−1)j Cλj lk (y + j/(2λukc))p�k (y + j/(2λukc)) j=0 = k cj � 1 c � ( x p + xj pj ), i (uk)λ−1 (λ − 1)! i uk j=1,j�=i with cj uniformly bounded from above, since f (λ) is L1 Lipschitz except at a. ´ ´ � Notice that Jki does not intersect each other, so S(λ)E[ R (lk p�k )2 dy] ≥ S(λ) ki=1 E[ J i (lk p�k )2 dy] k �k 1 c c� � k � + cjk 2] ≥ E[ 2uk { ( x p x p )} i=1 (uk)λ−1 (λ−1)! i i j=1,j�=i j j uk For constant u being large enough (only depend on λ, sup cjk and c), � �k cjk �k c 1 1 2 2 2 2 E[ ki=1 { (uk)1λ−1 ( (λ−1)! xi p�i + uk j=1,j�=i xj pj )} ] ≥ c(λ) k2λ−1 j=1 E[xj pj ] � k2λ ||p||2 , with some constant c(λ) only depends on λ and u. In addition, from condition C5, we know that θIk θ � ||δ||22 . Therefore θIk θ � ||δ||22 + k12λ ||p||22 � k12λ ||θ||22 . Hence, the smallest eigenvalue r(Ik ) of Ik is bounded by kcλ , with some generic constant c depending on λ, x, the the L1 Lipshitz coefficient of f (k) at set R − [a − η, a + η]. � Proof of Lemma 9. Proof. The proof of this Lemma is based on Chen (2008) results of Sieve extremum estimator consistency. Below I recall the five Conditions listed in Theorem 3.1 of Chen (2008). Theorem 3.1 in Chen (2008) shows that as long as the following conditions are satisfied, the Sieve estimator is consistent. Condition C11. [Identification] (1) L(θ0 ) < ∞, and if L(θ0 ) = −∞ then L(θ) > −∞ for all θ ∈ Θk \ {θ0 } for all k ≥ 1. (2)there are a nonincreasing positive function δ(·) and a positive function m(·) such that for all ε > 0 and k ≥ 1, L(θ0 ) − sup θ∈Θk :d(θ,θ0 )L(θk )≥ε ≥ δ(k)m(ε) > 0. Condition C 12. [Sieve space] Θk ⊂ Θk+1 ⊂ Θ for all k ≥ 1; and there exists a sequence πk θ0 ∈ Θk such that d(θ0 , πk θ0 ) → 0 as k → 0. Condition C13. [Continuity] (1)L(θ) is upper semicontinuous on Θk under metric d(·, ·). (2) |L(θ0 ) − L(πk(n) θ0 )| = o(δ(k(n))). Condition C14. [compact Sieve space] Θk is compact under d(·, ·). Condition C15. [uniform convergence] 1) For all k ≥ 1, plim supθ∈Θk |Ln (θ) − L(θ)| = 0. 2) � c(k(n)) = op (δ(k(n))) where � c(k(n)) := supθ∈Θk |Ln (θ) − L(θ)|. 3) ηk(n) = o(δ(k(n))). ERRORS IN THE DEPENDENT VARIABLE OF QUANTILE REGRESSION MODELS 31 These assumptions are verified below: For Identification: Let d be the metric induced by theL2 norm || · ||2 defined on D. By assumption C4 and compactness of D, L(θ0 ) − supθ∈Θ:d(θ,θ0 ) L(θk ) > 0 for any ε > 0. So let δ(k) = 1, Identification condition holds. For condition Sieve Space: Our Sieve space satisfies Θk ⊂ Θ2k ⊂ Θ. In general we can conside the sequence Θ2k(1) , Θ2k(2) , ..., Θ2k(n) ... instead of Θ1 , Θ2 , Θ3 ... with k(n) being an increasing function and limn→∞ k(n) = ∞. For condition Continuity: Since we assume f is continuous and L1 Lipshitz, (1) is satisfied. (2) is statisfied under the construction of our Sieve space. Condition Compact Sieve Space is trivial. The condition Uniform Convergence is also easy to verify since the entropy metric of space Θk is finite and uniformly bounded by the entropy metric of Θ. Therefore we have stochastic equi-continuity, i.e., supθ∈Θk |Ln (θ) − L(θ)| = Op ( √1n ). In (3), ηk is defined as the error in the maximization procedure. Since δ is constant, as soon as our maximizer θ�k satisfies Ln (θ�k ) − supθ∈Θk Ln (θk ) → 0, (3) is verified. Hence, the Sieve MLE estimator is consistent. � Proof of Lemma 10. Proof. This theorem follows directly from Lemma 8. If the neighborhood Nk of θ0 is chosen as Nk := {θ ∈ Θk : |θ − θ0 |2 ≤ µkλk }, ´ 2 0 )−g(y|x,θ)) with µk being a sequence converging to 0, then the non-linear term in Ex [ Y (g(y|x,θ dy] g(y|x,θ0 ) is dominated by the first order term θI k θ� . � Proof of Theorem 1. Proof. Suppose θ = (β(·), σ) ∈ Θk is the Sieve estimator. By Lemma 8, the consistency of Sieve estimator shows that ||θ − θ0 ||2 →p 0. Denote τi = i−1 k , i = 1, 2, ..., k First order condition gives: and En [−xf � (y − xβ(τi )|σ)] = 0. (B.10) En [gσ (y|x, β, σ)] = 0. (B.11) Assume that we choose k = k(n) such that: (1)k λ+1 ||θ − θ0 ||2 →p 0. (2)k r ||θ − θ0 ||2 → ∞ with p → 1. 32 HAUSMAN, LUO, AND PALMER And assume that r > λ + 1. Such sequence k(n) exists because Lemma 7 shows an upper − 1 bound for ||θ − θ0 ||, i.e., ||θ0 − θ||2 �p n 2(2+λ) . From the first order condition, we know that: En [( fτ1 (y|x, θ), fτ2 (y|x, θ), ..., fτk (y|x, θ) gσ (y|x, θ) , )] = 0. g(y|x, θ) g(y|x, θ) Denote fθk = (fτ1 (y|x, θ), fτ2 (y|x, θ), ..., fτk (y|x, θ)). Therefore, 0 = En [( fθ ,0 gσ,0 fθ ,0 gσ,0 f θk g σ fθ g σ , )] = (En [( k , )] − En [( k , )]) + En [( k , )]. g g g g g0 g0 g0 g0 The first term can be decomposed as: En [( fθ ,0 gσ,0 fθk g σ fθ g σ fθ g σ , )] − En [( k , )] + En [( k , ) − ( k , )]. g g g0 g0 g0 g0 g0 g0 By C10, we know that CLT holds pointwisely in the space Θk for En [( f θk g σ g0 , g0 )]. √ n(En [ fgu ] Define Σ := U [0, 1] × D. Consider a mapping H : Σ �→ R,H((u, θ)) = − E[ fgu ]). By Donskerness of D and U [0, 1], Σ is Donsker. By C10, we have pointwise CLT for H((u, θ)). By condition C2, the Lipshitz condition garantees that E[| fgu (θ) − fgu (θ� )|2 ] � ||θ − θ� ||22 . Therefore, stochastic equicontinuity holds: for γ small enough, P r( sup {|u−u� |≤γ,||θ−θ � || 2 ≤γ,θ,θ� ∈Θ} |H((u, θ)) − H((u� , θ� ))| ≥ γ) → 0. Similarly we have stochastic equicontinuity on f θk g σ fθk ,0 gσ,0 g0 , g0 ) − ( g0 , g0 )] = op ( √1n ). Notice E[( ( fθk ,0 gσ,0 g0 , g0 )] = 0, so En [( √ nEn [ ggσ ]. f θk g σ fθk ,0 gσ,0 g0 , g0 ) − ( g0 , g0 )] f θk g σ f θk g σ g , g )] − En [( g0 , g0 )] + op (1). fθ g En [( gk,0 , gσ,0 )], we can establishes 0 0 = op ( √1n ) + E[( f θk g σ g0 , g0 ) − Therefore, the first term = En [( Similarly, for the second term, En [( where Gn (·) := Therefore, √ fθk,0 gσ,0 1 1 , )] = √ Gn (τ1 , τ2 , ..., τk ) + op ( √ ) ∈ Rkrdx +dp , g0 g0 n n nEn [( fgτ0 |0≤τ ≤1 , En [( a functional CLT: gσ,0 g )] is weakly converging to a tight Gaussian process. fθk g σ fθ g σ 1 , )] − En [( k , )]] = − √ Gn (τ1 , τ2 , ..., τk )(1 + op (1)). g g g0 g0 n Similarly by stochastic equicontinuity, we can replace fθk with fθ0 and gσ with σ0 on the left hand side of the above equation. And now we get: E[(fθ0 , gσ0 )( g − g0 1 )] = √ Gn (τ1 , τ2 , ..., τk )(1 + op (1)), 2 n g0 (B.12) ERRORS IN THE DEPENDENT VARIABLE OF QUANTILE REGRESSION MODELS 33 where g − g0 = g(y|x, θ) − g(y|x, θ0 ) := g(y|x, θ) − g(y|x, πk θ0 ) + (g(y|x, πk θ0 ) − g(y|x, θ0 )). 1 By the construction of Sieve, g(y|x, πk θ0 ) − g(y|x, θ0 ) = O( kr+1 ), which depends only on the order of Spline employed in the estimation process. The first term g(y|x, θ) − g(y|x, πk θ0 ) = (θ − πk θ0 )(fθ�, gσ� )� , where θ� and σ � are some value between θ and πk θ0 . Denote I�k = E[(fθ0 , gσ0 )� (fθ�, gσ� )]. The eigenvalue norm ||I�k − Ik ||2 = ||E[(fθ0 , gσ0 )� (fθ� − fθ0 , gσ� − gσ0 )]||2 . While ||fθ� − fθ0 , gσ� − gσ0 ||2 � ||θ − πk θ0 ||2 + ||πk θ0 − θ0 ||2 � ||θ − πk θ0 ||2 (By the choice of k, ||θ − πk θ0 ||2 ≥ ||θ − θ0 ||2 − ||πk θ0 − θ0 ||2 . The first term dominates the second term because O( k1r ) � ||πk θ0 − θ0 ||2 and k r ||θ − θ0 ||2 →p ∞.) Since supx,y ||(fθ0 , gσ0 )||2 � k, ||I�k − Ik ||2 � k||θ − πk θ0 ||2 . Denote H := I�k − Ik , so ||H||2 � k||θ − πθ0 ||2 � ||θ − θ0 ||2 �p 1 . kλ (B.13) I�k−1 = (Ik + H)−1 = Ik−1 (I + Ik−1 H)−1 . By (J.13), ||Ik−1 H||2 ≤ k λ o(1) = op (1). Therefore, I + Ik−1 H is invertible and has eigenvalues kλ bounded from above and below uniformly in k. Hence the matrix I�k−1 ’s largest eigenvalue is bounded by Ck λ for some generic constant C with p → 1. Furthermore, I�k−1 = Ik−1 (1 + op (1)). kλ Apply our result in (J.12), we get: θ − πθ0 = Ik−1 √1n Gn (1 + op (1)). So ||θ − θ0 ||2 = Op ( √ ). n Moreover, under condition C.11, ||θ − θ0 ||2 �p Since at the beginning we assume: (1) k λ+1 ||θ − θ0 ||2 → 0. (2) k r ||θ − θ0 ||2 → ∞ with p → 1, and ||θ0 − πθ0 ||2 = Op ( k1r ). kλ Now we prove the condition that ||θ − θ0 ||2 = Op ( √ ) n the regularity condition for k is: 2λ+1 (a) k √n → 0. λ+r−α (b) k √n → ∞. If k satisfies growth conditions (1) and (2), the Sieve estimator θk satisfies: kλ 1 kλ θk − θ0 = θk − πθ0 + πθ0 − θ0 = Op ( √ ) + O( r ) = Op ( √ ). k n n Under condition C.11, 1 k λ−α 1 θ − πθ0 = Ik−1 √ Gn (1 + op (1)) �p √ � r . k n n Therefore, 1 θ − θ0 = Ik−1 √ Gn (1 + op (1)). n And pointwise asymptotic normality holds. 34 HAUSMAN, LUO, AND PALMER For parameter σ, we would like to recall Chen’s (2008) result. In fact, we use a result slightly stronger than that in Chen (2008), but the reasoning are similar. For k satisfying growth condition in (1), we know that L(θ) − L(θ0 ) � ||θ − θ0 ||k , where θk ∈ Dk ∩ Nk and || · ||k is defined as < ·, I k · >. For a functional of interest, h : Θ :→ R, define the partial derivative: ∂h h(θ0 + t[θ − θ0 ]) − h(θ0 ) := lim . t→0 ∂θ t Define V̄ be the space spanned by Θ − θ0 . ∗ By Riesz representation theory, there exists a v ∗ ∈ V̄ such that ∂h ∂θ [θ − θ0 ] =< θ − θ0 , v >, where < ·, · > is induced by || · ||. Below are five conditions needed for asymptotic normality of h(θ). 0) ω Condition C 16. (i)There is ω > 0 such that |h(θ − θ0 − ∂h(θ ∂θ [θ − θ0 ])| = O(||θ − θ0 || ) uniformly in θ ∈ Θn with ||θ − θ0 || = o(1); 0) (ii)|| ∂h(θ ∂θ || < ∞. −1 (iii)there is πn v ∗ ∈ Θk such that ||πn v ∗ − v ∗ || × ||θ�n − θ0 || = op (n 2 ). This assumption is easy to verify because our function h is a linear function in σ: h(θ) = δ � σ. Therefore the first and third condition is automatically verified. The second condition is true � ]−1 δ). Denote I := E[g g � ] because of assumption C.6. So v ∗ = (0, E[gσ,0 gσ,0 σ σ,0 σ,0 Let εn denote any sequence satisfying: εn = o(n−1/2 ). Suppose the Sieve estimator converges to θ0 faster than δn . Define the functional operator µn = En − E. Condition C17. sup θ∈Θk(n) ;||θ−θ0 ||≤δn µn (l(θ, Z) − l(θ ± εn πn v ∗ , Z) − ∂l(θ0 , Z) [±εn πn v ∗ ]) = Op (ε2n ). ∂θ In fact this assumption is the stochastic continuity assumption. It can be verified by PDonskerness of Θ and L1 Lipshitz property that we assume on the function f . Denote K(θ1 , θ2 ) = L(θ1 ) − L(θ2 ). Condition C18. E[ � ∂l(θ�n , Z) [πn v ∗ ]] =< θ�n − θ0 , πn v ∗ > +o(n−1/2 ). ∂θ � � � � θn ) −1 θn ) We know that E[ ∂l(θ∂θn ,Z) [πn v ∗ ]] = E[ gσ (y|x, Iσ δ] = E[ gσ (y|x, ]Iσ−1 δ. g g 2λ+1 kλ We know that |θ�n − θ0 |2 = Op ( √ ) and k √n → 0, so |θ − θ0 |22 = op (n−(λ+1)/(2λ+1) ) = n o(n−1/2 ). � θn ) And we know that E[ gσ (y|x, ] = 0. So g0 ERRORS IN THE DEPENDENT VARIABLE OF QUANTILE REGRESSION MODELS 35 gσ (y|x, θ�n ) 1 1 ] = E[gσ (y|x, θ�n )( − )]. g g g0 By L1 Lipshitz condition |g−g0 | ≤ C1 (x, y)|θ−θ0 |2 , and |gσ (|θ�n )−gσ(|θ0 )| ≤ C2 (x, y)|θ−θ0 |2 . E[ And such coefficients has nice tail properties. Therefore, 1 1 ∂g/∂θ E[gσ (y|x, θ�n )( − )] = E[gσ (y|x, θ0 )( 2 )(θ�n − θ0 )] + O(|θ�n − θ0 |22 ). g g0 g � Then E[ ∂l(θ∂θn ,Z) [πn v ∗ ]] = E[(θ�n − θ0 )� ∂g/∂θ gσ (y|x, θ0 )� δ] + op (n−1/2 ). g2 The first part of the equation above is < θ�n − θ0 , πn v ∗ >. Condition C19. (i) µn ( ∂l(θ∂θ0 ,Z) [πn v ∗ − v ∗ ]) = op (n−1/2 ). (ii) E[ ∂l(θ∂θ0 ,Z) [πn v ∗ ]] = o(n−1/2 ). (i) is obvious since v ∗ = πn v ∗ . (ii) is also obvious since it equals to 0. Condition C20. Gn { ∂l(θ0 , Z) ∗ [v ]} →d N (0, σv2 ), ∂θ withσv2 > 0. This assumption is also not hard to verify because ∂l(θ∂θ0 ,Z) [v ∗ ] is a random variable with variance bounded from above. It is not degenerate because E[gσ gσ� ] is positive definite. Then by theorem 4.3 in Chen (2008), for any σ, our sieve estimator θn have the property: √ nδ � (σn − σ0 ) →d N (0, σv2 ). √ Since the above conclusion is true for arbitrary δ, n(σn − σ0 ) must be jointly asymptotic √ normal, i.e., there exists a positive definite matrix V such that n(σn − σ0 ) →d N (0, V ). � B.4. Lemmas and Theorems in Appendix I. In this subsection we prove the Lemmas and Theorems stated in section 5. Proof of Theorem 2. Proof. For simplicity, we assume the φε (s) is a real function. The proof is similar when it is complex. Then, ˆ q 1 sin((y − xβ)s) 1 F (xβ, y, q) = − τ + ds. 2 πs φε (s) 0 Under condition OS, |F (xβ, y, q)| ≤ C(q λ−1 + 1) if λ ≥ 1, for some fixed C and q large enough. Under condition SS, |F (xβ, y, q)| ≤ C1 exp(q λ /C2 ) for some fixed C1 and C2 . ´q ds Let h(x, y) := E[ 0 n cos((y−xβ)s) π φε (s) ]. 36 HAUSMAN, LUO, AND PALMER ´q ds By change of integration, E[ 0 n cos((y−xβ)s) π φε (s) ] ≤ C. Let σ(x)2 = E[F (xβ, y, τ, q)2 |x]. Under condition OS, by similar computation, C1 q 2(λ−1) ≤ σ(x)2 ≤ C2 q 2(λ−1) , and under condition SS, C1 exp(q λ /C3 ) ≤ σ(x)2 ≤ C2 exp(q λ /C4 ), for some fixed C1 , C2 , C3 , C4 . For simplicity, let m(λ) = q λ−1 if OS condition holds, and let m(λ) = exp(q λ /C) if SS holds with constant C. β is solving: h(x) F (xβ, y, τ, q)] = 0. σ(x)2 Adopting the usual local expansion: En [x (B.14) h(x, y) h(x, y) h(x) (F (xβ, y, τ, q)−F (xβ0 , y, τ, q))] = −{En [ (F (xβ0 , y, τ, q)]−E[ (F (xβ0 , y, τ, q)]} 2 2 σ(x) σ(x) σ(x)2 (B.15) h(x, y) −E[ (F (xβ0 , y, τ, q)]. σ(x)2 The left hand side equals: En [ h(x, y)2 q2 ](1 + o (1)) = E[l(x, q)x� x](β − β0 )(1 + op (1)). p σ(x)2 m(λ)2 l(x, q) is a bounded function converging uniformly to l(x, ∞). The first term of the RHS is: 1 h(x, y) f (xβ, y, τ, q) 1 − √ Gn [x ] = Op ( √ ). σ(x) σ(x) n nm(λ) (β − β0 )E[x� x The second term of the RHS is: h(x) q E[ (F (xβ0 , y, τ, q)] = O( ). 2 σ(x) m(λ)2 √ ) + O( 1 ). Therefore, β − β0 = Op ( m(λ) q n 1 Under SS, the optimal q level is log(n) λ , and the optimal convergence speed is therefore −1 1 log(n) λ . Under OS condition, the optimal q = n 2λ , and the optimal convergence speed is 1 � 1 . n 2λ