ERRORS IN THE DEPENDENT VARIABLE OF QUANTILE REGRESSION MODELS

advertisement
ERRORS IN THE DEPENDENT VARIABLE
OF QUANTILE REGRESSION MODELS
JERRY HAUSMAN, YE LUO, AND CHRISTOPHER PALMER
MASSACHUSETTS INSTITUTE OF TECHNOLOGY
APRIL 2014
*PRELIMINARY AND INCOMPLETE*
*PLEASE DO NOT CITE OR CIRCULATE WITHOUT PERMISSION*
Abstract. The usual quantile regression estimator of Koenker and Bassett (1978) is biased
if there is an additive error term in the dependent variable. We analyze this problem as
an errors-in-variables problem where the dependent variable suffers from classical measurement error and develop a sieve maximum-likelihood approach that is robust to left-hand side
measurement error. After describing sufficient conditions for identification, we employ global
optimization algorithms (specifically, a genetic algorithm) to solve the MLE optimization
problem. We also propose a computationally tractable EM-type algorithm that accelerates
the computational speed in the neighborhood of the optimum. When the number of knots
in the quantile grid is chosen to grow at an adequate speed, the sieve maximum-likelihood
estimator is asymptotically normal. We verify our theoretical results with Monte Carlo simulations and illustrate our estimator with an application to the returns to education highlighting
important changes over time in the returns to education that have been obscured in previous
work by measurement-error bias.
Keywords: Measurement Error, Quantile Regression, Functional Analysis, EM Algorithm
Hausman: Professor of Economics, MIT Economics; jhausman@mit.edu. Luo: PhD Candidate, MIT Economics;
lkurt@mit.edu. Palmer: PhD Candidate, MIT Economics; cjpalmer@mit.edu.
We thank Victor Chernozhukov and Kirill Evdokimov for helpful discussions, as well as workshop participants
at MIT. Cuicui Chen provided research assistance.
ERRORS IN THE DEPENDENT VARIABLE OF QUANTILE REGRESSION MODELS
1
1. Introduction
In this paper, we study the consequences of measurement error in the dependent variable
of conditional quantile models and propose an adaptive EM algorithm to consistently estimate
the distributional effects of covariates in such a setting. The errors-in-variables (EIV) problem
has received significant attention in the linear model, including the well-known results that
classical measurement error causes attenuation bias if present in the regressors and has no effect
on unbiasedness if present in the dependent variable. See Hausman (2001) for an overview.
In general, the linear model results do not hold in nonlinear models.1 We are particularly
interested in the linear quantile regression setting.2 Hausman (2001) observes that EIV in the
dependent variable in quantile regression models generally leads to significant bias, a result
very different from the linear model intuition.
In general, EIV in the dependent variable can be viewed as a mixture model.3 One way to
estimate a mixture distribution is to use the expectation-maximization (EM) algorithm.4 We
show that under certain discontinuity assumptions, by choosing the growth speed of the number
of knots in the quantile grid, our estimator has fractional polynomial of n convergence speed
and asymptotic normality. We suggest using the bootstrap for inference. However, in practice
we find that EM algorithm is only feasible in the local neighborhood of the true parameter.
We propose to use global optimization algorithms such as the genetic algorithm, in order to
compute the sieve ML estimator we propose.
� ) for quantile τ may be far from the
Intuitively, the estimated quantile regression line xi β(τ
observed yi because of LHS measurement error or because the unobserved conditional quantile
ui of observation i is far from τ . Our MLE framework implemented via an EM algorithm
effectively estimates the likelihood that a given εik is large because of measurement error rather
than an unobserved conditional quantile that is far away from τk . The estimate of the joint
distribution of the conditional quantile and the measurement error allows us to weight the log
1
Schennach (2008) establishes identification and a consistent nonparametric estimator when EIV exists in an
explanatory variable. Studies focusing on nonlinear models in which the left-hand side variable is measured
with error include Hausman et. al (1998) and Cosslett (2004), who study probit and tobit models, respectively.
2
Carroll and Wei (2009) proposed an iterative estimator for the quantile regression when one of the regressors
has EIV.
3
A common feature of mixture models under a semiparametric or nonparametric framework is the ill-posed
inverse problem, see Evdokimov (2010). We face the ill-posed problem here, and our model specifications are
linked to the Fredholm integral equation of the first kind. The inverse of such integral equations is usually
ill-posed even if the integral kernel is positive definite. The key symptom of these model specifications is that
the high-frequency signal of the objective we are interested in is wiped out, or at least shrunk, by the unknown
noise if its distribution is smooth. To uncover these signals is difficult and all feasible estimators have a lower
√
speed of convergence compare to the usual n case. The convergence speed of our estimator relies on the decay
speed of the eigenvalues of the integral operator. We explain this technical problem in more detail in the related
section of this paper.
4
Dempster, Laird and Rubin (1977) propose that the EM algorithm performs well in finite mixture models when
the EIV follows some (unknown) parametric family of distributions. This algorithm can be extended to infinite
mixture models.
2
HAUSMAN, LUO, AND PALMER
likelihood contribution of observation i more in the estimation of β(τk ) where they are likely
to have ui ≈ τk . In the case of Gaussian errors in variables, this estimator reduces to weighted
least squares, with weights equal to the probability of observing the quantile-specific residual
for a given observation as a fraction of the total probability of the same observation’s residuals
across all quantiles.
An empirical example (extending Angrist et al., 2006) studies the heterogeneity of returns
to education across conditional quantiles of the wage distribution. We find that when we
correct for likely measurement error in the self-reported wage data, we estimate considerably
more heterogeneity across the wage distribution in the returns to education. In particular, the
education coefficient for the bottom of the wage distribution is lower than previously estimated,
and the returns to education for latently high-wage individuals has been increasing over time
and is much higher than previously estimated. By 2000, the returns to education for the top of
the conditional wage distribution are over three times larger than returns for any other segment
of the distribution.
The rest of the paper proceeds as follows. In Section 2, we introduce model specification and
identification conditions. In Section 3, we consider the MLE estimation method and analyzes its
properties. In Sections 4 and 5, we discuss sieve estimation and the EM algorithm. We present
Monte Carlo simulation results in Section 6, and Section 7 contains our empirical application.
Section 8 concludes. The Appendix contains an extension using the deconvolution method and
additional proofs.
Notation: Define the domain of x as X . Define the space of y as Y. Denote a ∧ b as the
minimum of a and b, and denote a ∨ b as the larger of a and b. Let −
→ be weak convergence
d
(convergence in distribution), and →
− stands for convergence in probability. Let −
→∗ be weak
p
d
convergence in outer probability. Let f (ε|σ) be the p.d.f of the EIV ε parametrized by σ.
Assume the true parameters are β0 (·) and σ0 for the coefficient of the quantile model and
parameter of the density function of the EIV. Let dx be the dimension of x. Let Ω be the
�
domain of σ. Let dσ be the dimension of parameter σ. Define ||(β0, σ0 )|| := ||β0 ||22 + ||σ0 ||22
as the L2 norm of (β0, σ0 ), where || · ||2 is the usual Euclidean norm. For βk ∈ Rk , define
�
||(βk, σ0 )||2 := ||βk ||22 /k + ||σ0 ||22 .
2. Model and Identification
We consider the standard linear conditional quantile model, where the τ th quantile of the
dependent variable y ∗ is a linear function of x
Qy∗ (τ |x) = xβ(τ ).
However, we are interested in the situation where y ∗ is not directly observed, and we instead
observe y where
y = y∗ + ε
ERRORS IN THE DEPENDENT VARIABLE OF QUANTILE REGRESSION MODELS
3
Figure 1. Check Function ρτ (z)
τ
1−τ
0
and ε is a mean-zero, i.i.d error term independent from y ∗ , x and τ .
Unlike the linear regression case where EIV in the left hand side variable does not matter
for consistency and asymptotic normality, EIV in the dependent variable can lead to severe
bias in quantile regression. More specifically, with ρτ (z) denoting the check function (plotted
in Figure 1)
ρτ (z) = z(τ − 1(z < 0)),
the minimization problem in the usual quantile regression
β(τ ) ∈ arg min E[ρτ (y − xb)],
b
(2.1)
is generally no longer minimized at the true β0 (τ ) when EIV exists in the dependent variable.
When there exists no EIV in the left-hand side variable, i.e. y ∗ is observed, the FOC is
E[x(τ − 1(y ∗ < xβ(τ )))] = 0,
(2.2)
where the true β(τ ) is the solution to the above system of FOC conditions as shown by Koenker
� )
and Bassett (1978). However, with left-hand side EIV, the FOC condition determining β(τ
becomes
E[x(τ − 1(y ∗ + ε < xβ(τ )))] = 0.
(2.3)
For τ �= .5, the presence of ε will result in the FOC being satisfied at a different estimate of
β than in equation (2.2) even in the case where ε is symmetrically distributed because of the
asymmetry of the check function. In other words, in the minimization problem, observations for
4
HAUSMAN, LUO, AND PALMER
which y ∗ ≥ xβ(τ ) and should therefore get a weight of τ may end up on the left-hand side of the
check function, receiving a weight of (1−τ ) . Thus, equal-sized differences on either side of zero
do not cancel each other out. For median regression, τ = .5 and so ρ.5 (·) is symmetric around
zero. This means that if ε is symmetrically distributed and β(τ ) symmetrically distributed
around τ = .5 (as would be the case, for example, if β(τ ) were linear in τ ), the expectation in
equation (2.3) holds for the true β0 (τ ). However, for non-symmetric ε, equation (2.3) is not
satisfied at the true β0 (τ ).
Below we demonstrate a simple Monte-Carlo example to show the existence of the bias in
the quantile regression with random disturbances on the dependent variable y.
Example. The Data-Generating Process for the Monte-Carlo results is
yi = β0 (ui ) + x1i β1 (ui ) + x2i β2 (ui ) + εi
with the measurement error εi distributed as N (0, σ 2 ) and the unobserved conditional quantile
ui of observation i following ui ∼ U [0, 1]. The coefficient function β(τ ) has components β0 (τ ) =
√
0, β1 (τ ) = exp(τ ), and β2 (τ ) = τ . The variables x1 and x2 are drawn from independent lognormal distributions LN (0, 1). The number of observations is 1,000.
Table 1 presents Monte-Carlo results for three cases: when there is no measurement error and
when the variance of ε equals 4 and 16. The simulation results show that under the presence
of measurement error, the quantile regression estimator is severely biased.5 Comparing the
mean bias when the variance of the measurement error increases from 4 to 16 shows that the
bias is increasing in the variance of the measurement error. Intuitively, the information of the
functional parameter β(·) is decaying when the variance of the EIV becomes larger.
2.1. Identification and Regularity Conditions. In the linear quantile model, it is assumed
that for any x ∈ X , xβ(τ ) is increasing. Suppose x1 , ..., xdx are dx -dimensional linearly independent vectors in int(X ). So Qy∗ (τ |xi ) = xi β(τ ) must be strictly increasing in τ . Consider
the linear transformation of the model with matrix A = [x1 , ..., xdx ]� :
Q∗y (τ |x) = xβ(τ ) = (xA−1 )(Aβ(τ )).
� ) = Aβ(τ ). The transformed model becomes
Let x
� = xA−1 and β(τ
� ),
Q∗y (τ |�
x) = x
�β(τ
(2.4)
(2.5)
with every coefficient β�i (τ ) being weakly increasing in τ for 1 ≤ i ≤ k. Therefore, WLOG,
we can assume that the coefficients β(τ ) are increasing. Such βi (τ ), 1 ≤ i ≤ k, are called comonotonic functions. From now on, we will assume that βi (τ ) is an increasing function which
� ). All of the convergence and asymptotic results for β�i (τ ) hold for the
has the properties of β(τ
parameter βi after the inverse transform A−1 .
5A notable exception to this is the median regression results. We find that for symmetrically distributed EIV
and uniformly distributed β(τ ), the median regression results are unbiased.
ERRORS IN THE DEPENDENT VARIABLE OF QUANTILE REGRESSION MODELS
5
Table 1. Monte-Carlo Results: Mean Bias
Parameter
β1 (τ ) = eτ
β2 (τ ) =
√
τ
EIV
Distribution
ε =0
ε ∼ N (0, 4)
ε ∼ N (0, 16)
True parameter:
ε =0
ε ∼ N (0, 4)
ε ∼ N (0, 16)
True parameter:
Quantile (τ )
0.1
0.25
0.5
0.75
0.006 0.003 0.002 0.000
0.196 0.155 0.031 -0.154
0.305 0.246 0.054 -0.219
1.105 1.284 1.649 2.117
0.000 -0.003 -0.005 -0.006
0.161 0.068 -0.026 -0.088
0.219 0.101 -0.031 -0.128
0.316
0.5
0.707 0.866
0.9
-0.005
-0.272
-0.391
2.46
-0.006
-0.115
-0.174
0.949
Notes: Table reports mean bias (across 500 simulations) of slope coefficients estimated for each
quantile τ from standard quantile regression of y on a constant, x1 , and x2 where
y = x1 β1 (τ ) + x2 β2 (τ ) + ε and ε is either zero (no measurement error case, i.e. y ∗ is observed) or ε is
distributed normally with variance 4 or 16. The covariates x1 and x2 are i.i.d. draws from LN (0, 1).
N = 1, 000.
Condition C 1 (Properties of β(·)). We assume the following properties on the coefficient
vectors β(τ ):
(1) β(τ ) is in the space M [B1 × B2 × B3 ... × Bdx ] where the functional space M is defined as
the collection of all functions f = (f1 , ..., fdx ) : [0, 1] → [B1 ×...×Bdx ] with Bi ⊂ R being
a closed interval ∀ i ∈ {1, ..., dx } such that each entry fi : [0, 1] → Bi is monotonically
increasing in τ .
(2) Let Bi = [li , ui ] so that li < β0i (τ ) < ui ∀ i ∈ {1, ..., dx } and τ ∈ [0, 1].
(3) β0 is a vector of C 1 functions with derivative bounded from below by a positive constant
∀ i ∈ {1, ..., dx }.
(4) The domain of the parameter σ is a compact space Ω and the true value σ0 is in the
interior of Ω.
Under assumption C1 it is easy to see that the parameter space D := M × Ω is compact.
Lemma 1. The space M [B1 × B2 × B3 ... × Bk ] is a compact and complete space under Lp , for
any p ≥ 1.
Proof. See Appendix B.1.
�
Monotonicity of βi (·) is important for identification, because in the log-likelihood function,
´1
f (y|x) = 0 f (y − xβ(u)|σ)du is invariant when the distribution of random variable β(u) is
invariant. The function β(·) is therefore unidentified if we do not impose further restrictions.
Given the distribution of the random variable {β(u)|u ∈ [0, 1]dx }, the vector of functions
β : [0, 1] → B1 × B2 × B3 ... × Bdx is unique under the rearrangement if the functions βi ,
1 ≤ i ≤ dx , are co-monotonic.
6
HAUSMAN, LUO, AND PALMER
Condition C2 (Properties of x). We assume the following properties of the design matrix x:
(1) E[x� x] is non-singular.
(2) The domain of x, denoted as X , is continuous on at least one dimension, i.e. there
exists k ∈ {1, ..., dx } such that for every feasible x−k , there is a open set Xk ⊂ R such
that (Xk, x{−k} ) ⊂ X .
(3) Without loss of generality, βk (0) ≥ 0.
Condition C3 (Properties of EIV). We assume the following properties of the measurement
error ε:
(1)
(2)
(3)
(4)
(5)
(6)
The probability function f (ε|σ) is differentiable in σ.
For all σ ∈ Ω, there exists a uniform constant C > 0 such that E[| log f (ε|σ)|] < C.
f (·) is non-zero all over the space R, and bounded from above.
E[ε] = 0.
´∞
Denote φ(s|σ) := −∞ exp(isε)f (ε|σ)dε as the characteristic function of ε.
Assume for any σ1 and σ2 in the domain of σ, denoted as Ω, there exists a neighborhood
�∞
1)
k
of 0, such that φ(s|σ
k=2 ak (is) .
φ(s|σ2 ) can be expanded as 1 +
Denote θ := (β(·), σ) ∈ D. For any θ, define the expected log-likelihood function L(θ) as
follows:
L(θ) = E[log g(y|x, θ)],
(2.6)
where the density function g(y|x, θ) is defined as
ˆ 1
g(y|x, θ) =
f (y − xβ(u)|σ)du.
(2.7)
0
And define the empirical likelihood as
Ln (θ) = En [log g(y|x, θ)],
(2.8)
The main identification results rely on the monotonicity of xβ(τ ). The global identification
condition is the following:
Condition C4 (Identification). There does not exist (β1 , σ1 ) �= (β0 , σ0 ) in parameter space D
such that g(y|x, β1 , σ1 ) = g(y|x, β0 , σ0 ) for all (x, y) with positive continuous density or positive
mass.
Lemma 2 (Nonparametric Global Identification). Under condition C1-C3, for any β(·) and
f (·) which generates the same density of y|x almost everywhere as the true function β0 (·) and
f0 (·), it must be that:
β(τ ) = β0 (τ )
f (ε) = f0 (ε).
Proof. See Appendix B.1.
�
ERRORS IN THE DEPENDENT VARIABLE OF QUANTILE REGRESSION MODELS
7
We also summarize the local identification condition as follows:
Lemma 3 (Local Identification). Define p(·) and δ(·) as functions that measure the deviation
of a given β or σ from the truth: p(τ ) = β(τ ) − β0 (τ ) and δ(σ) = σ − σ0 . Then there does
not exist a function p(τ ) ∈ L2 [0, 1] and δ ∈ Rdσ such that for almost all (x, y) with positive
continuous density or positive mass
ˆ 1
ˆ 1
f (y − xβ(τ )|σ)xp(τ )dτ =
δ � fσ (y − xβ(τ ))dτ,
0
0
except that p(τ ) = 0 and δ = 0.
�
Proof. See Appendix B.1.
Condition C5 (Stronger local identification for σ). For any δ ∈ S dσ −1 ,
ˆ 1
ˆ 1
inf
(
fy (y − xβ(τ )|σ)xp(τ )dτ −
δ � fσ (y − xβ(τ ))dτ )2 > 0.
p(τ )∈L2 [0,1]
0
(2.9)
0
3. Maximum Likelihood Estimator
3.1. Consistency. The ML estimator is defined as:
� σ
(β(·),
�) ∈ arg
max
(β(·),σ)∈D
En [g(y|x, β(·), σ)].
(3.1)
The following theorem states the consistency property of the ML estimator.
Lemma 4 (MLE Consistency). Under conditions C1-C3, the random coefficients β(τ ) and the
parameter σ that determines the distribution of ε are identified in the parameter space D. The
maximum-likelihood estimator
� ˆ 1
�
�
(β(·), σ
�) ∈ arg max En log
f (y − xβ(τ )|σ)dτ
(β(·),σ)∈D
0
exists and converges to the true parameter (β0 (·), σ0 ) under the L∞ norm in the functional
space M and Euclidean norm in Ω with probability approaching 1.
Proof. See Appendix B.2.
�
The identification theorem is a special version of a general MLE consistency theorem (Van der
Vaart, 2000). Two conditions play critical roles here: the co-monotonicity of the β(·) function
and the local continuity of at least one right-hand side variable. If we do not restrict the
estimator in the family of monotone functions, then we will lose compactness of the parameter
space D and the consistency argument will fail.
3.2. Ill-posed Fredholm Integration of the First Kind. In the usual MLE setting with
the parameter being finite dimensional, the Fisher information matrix I is defined as:
8
HAUSMAN, LUO, AND PALMER
�
�
∂f ∂f �
I := E
.
∂θ ∂θ
In our case, since the parameter is continuous, the informational kernel I(u, v) is defined as6
��
� �ˆ 1
��
fβ(u) gσ �
fv
gσ �
I(u, v)[p(v), δ] = E
,
p(v)dv +
δ
.
(3.2)
g
g
g
0 g
If we assume that we know the true σ, i.e., δ = 0, then the informational kernel becomes
I0 (u, v) := E[
∂fu ∂fu �
].
g g
By condition C5, the eigenvalue of I and I0 are both non-zero.
Since E[X � X] < ∞, I0 (u, v) is a compact (Hilbert-Schmidt) operator. From the Riesz
Representation Theorem, L(u, v) has countable eigenvalues λ1 ≥ λ2 ≥ ... > 0 with 0 being the
only limit point of this sequence. It can be written as the following form:
I0 (·) =
∞
�
i=1
λi �ψi , ·� ψi
where ψi is the system of orthogonal functional basis, i = 1, 2, .... For any function s(·),
´1
0 I0 (u, v)h(v)dv = s(u) is called the Fredholm integral equation of the first kind. Thus, the
first-order condition of the ML estimator is ill-posed. In the next subsection we establish
convergence rate results for the ML estimator using a deconvolution method.7
Although the estimation problem is ill-posed for the function β, the estimation problem is
not ill-posed for the finite dimensional parameter σ, given that E[gσ gσ� ] is non-singular. In the
√
Lemma below, we show that σ converges to σ0 at rate 1/ n.
Lemma 5 (Estimation of σ). If E[gσ� gσ�� ] is positive definite and conditions C1-C5 hold, the ML
estimator σ
� has the following property:
Proof. See Appendix B.2.
1
σ
� − σ0 → Op ( √ ).
n
(3.3)
�
3.3. Bounds for the Maximum Likelihood Estimator. In this subsection, we use the
deconvolution method to establish bounds for the maximum likelihood estimator. Recall that
∂g
6For notational convenience, we abbreviate ∂f (y−xβ0 (τ )|σ0 ) as f
β(τ ) , g(y|x, β0 (τ ), σ0 ) as g, and ∂σ as gσ .
∂β(τ )
7Kuhn(1990) shows that if the integral kernel is positive definite with smooth degree r, then the eigenvalue
λi is decaying with speed O(n−r−1 ). However, it is generally more difficult to obtain a lower bound for the
eigenvalues. The decay speed of the eigenvalues of the information kernel is essential in obtaining convergence
rate. And from the Kuhn (1990) results, we see that the decay speed is linked with the degree of smoothness of
the function f . The less smooth the function f is, the slower the decaying speed is. Later in the paper we will
show that by assuming some discontinuity conditions, we can obtain a polynomial rate of convergence.
ERRORS IN THE DEPENDENT VARIABLE OF QUANTILE REGRESSION MODELS
9
the maximum likelihood estimator is the solution
(β, σ) = argmax(β,σ)∈D En [log(g(y|x, β, σ))]
1
Define χ(s0 ) = sup|s|≤s0 | φε (s|σ
|. We use the following smoothness assumptions on the
0)
distribution of ε (see Evodokimov, 2010).
Condition C6 (Ordinary Smoothness (OS)). χ(s0 ) ≤ C(1 + |s0 |λ ).
Condition C7 (Super Smoothness (SS)). χ(s0 ) ≤ C1 (1 + |s0 |C2 ) exp(|s0 |λ /C3 ).
The Laplace distribution L(0, b) satisfies the Ordinary Smoothness (OS) condition with λ =
2. The Chi-2 Distribution χ2k satisfies the OS condition with λ = k2 . The Gamma Distribution
Γ(k, θ) satisfies the OS condition with λ = k. The Exponential Distribution satisfies condition
OS with λ = 1. The Cauchy Distribution satisfies the Super Smoothness (SS) condition with
λ = 1. The Normal Distribution satisfies the SS condition with λ = 2.
Lemma 6 (MLE Convergence Speed). Under assumptions C1-C5 and the results in Lemma 4,
(1) if Ordinary Smoothness holds (C6), then the ML estimator of β� satisfies for all τ
� ) − β0 (τ )| � n
|β(τ
1
− 2(1+λ)
(2) if Super Smoothness holds (C7), then the ML estimator of β� satisfies for all τ
1
Proof. See Appendix B.2.
� ) − β0 (τ )| � log(n)− λ .
|β(τ
�
4. Sieve estimation
In the last section we demonstrated that the maximum likelihood estimator restricted to
parameter space D converges to the true parameter with probability approaching 1. However,
the estimator still lives in a large space with β(·) being k-dimensional co-monotone functions
and σ being a finite dimensional parameter. Although theoretically such an estimator does
exist, in practice it is computationally infeasible to search for the maximizer within this large
space. Instead, we consider a series estimator of β(·) to mimic the co-monotone functions β(·).
There can be many choices of series, for example, triangular series, (Chebyshev) power series,
B-splines, etc. In this paper, we choose to use splines because of their computational advantages
in calculating the Sieve estimator used in the EM algorithm. For simplicity, we use the piecewise
constant Sieve space. Below we introduce the formal definition of our Sieve space
Definition 1. Define Dk = Θk ×Ω, where Θk stands for increasing piecewise constant functions
on [0, 1]dx with knots at ki , i = 0, 1, ..., k − 1. In other words, for any β ∈ Θk , βj is a piecewise
constant function on intervals [ ki , i+1
k ) for 0 ≤ i ≤ k and 1 ≤ j ≤ dx .
Furthermore, let Pk denote the product space of all dx -dimensional piecewise-constant functions and Rdσ .
10
HAUSMAN, LUO, AND PALMER
We know that the L2 distance of the space Dk to the true parameter θ0 satisfies d2 (θ0 , Dk ) ≤
C for some generic constant C.
The Sieve estimator is defined as follows:
1
k
Definition 2 (Sieve Estimator).
(βk , σk ) = arg max En [log g(y|x, β, σ)]
(β,σ)∈Dk
(4.1)
´
2
1 )−g(y|x,θ2 ))
Let d(θ1 , θ2 )2 := E[ R (g(y|x,θg(y|x,θ
dy] be a pseudo metric on the parameter space D.
0)
We know that d(θ, θ0 ) ≤ L(θ0 |θ0 ) − L(θ|θ0 ).
Let || · ||d be a norm such that ||θ0 − θ||d ≤ Cd(θ0 , θ). Our metric || · ||d here is chosen to be:
||θ||2d := �θ, I(u, v)[θ]� .
(4.2)
By Condition C4, || · ||d is indeed a metric.
For the Sieve space Dk and any θk ∈ Dk , define I˜ as the following matrix:
 ´

I˜ := E 
1
k
0
fτ dτ,
´
2
k
1
k
fτ dτ, ...,
g
´1
k−1
k
 ´ 1
� 
´ k2
´1
k
fτ dτ g
f
dτ,
f
dτ,
...,
f
dτ
k−1
τ
τ
τ
1
gσ  
σ 0
k
k
, 
,  .
g
g
g
Again, by Condition C4, this matrix I˜ is non-singular. Furthermore, if a certain discontinuity
condition is assumed, the smallest eigenvalue of I˜ can be proved to be bounded away from some
polynomial of k.
Condition C8 (Discontinuity of f ). Suppose there exists a positive integer λ such that f ∈
C λ−1 (R), and the λth order derivative of f equals:
f (λ) (x) = h(x) + δ(x − a),
(4.3)
with h(x) being a bounded function and L1 Lipschitz except at a, and δ(x − a) is a Dirac
δ-function at a.
Remark. The Laplace distribution satisfies the above assumption with λ = 1. There is an
intrinsic link between the above discontinuity condition and the tail property of the characteristic function stated as the (OS) condition. Because φf � (s) = isφf (s), we know that
1
φf (s) = (is)
λ φf (λ) (s), while φf (λ) (s) = O(1) under the above assumption. Therefore, assumption C9 indicates that χf (s) := sup | φf1(s) | ≤ C(1 + sλ ). In general, these results will hold as
long as the number of Dirac functions in f (λ) are finite in (4.3).
˜
The following Lemma establishes the decay speed of the minimum eigenvalue of I.
Lemma 7. If the function f satisfies condition C8 with degree λ > 0,
ERRORS IN THE DEPENDENT VARIABLE OF QUANTILE REGRESSION MODELS
11
˜ denoted as r(I),
˜ has the following property:
(1) the minimum eigenvalue of I,
1
˜
� r(I).
kλ
(2)
1
kλ
� supθ∈Pk
||θ||d
||θ||
�
Proof. See Appendix B.3.
The following Lemma establishes the consistency of the Sieve estimator.
Lemma 8 (Sieve Estimator Consistency). If conditions C1-C6 and C9 hold, The Sieve estimator defined in (4.1) is consistent.
Given the identification assumptions in the last section, if the problem is identified, then
there exists a neighborhood N of θ0 , for any θ ∈ N , θ �= θ0 , we have:
�ˆ
�
(g(y|x, θ0 ) − g(y|x, θ))2
L(θ0 |θ0 ) − L(θ|θ0 ) ≥ Ex
> 0.
(4.4)
g(y|x, θ0 )
Y
�
Proof. See Appendix B.3.
Unlike the usual Sieve estimation problem, our problem is ill-posed with decaying eigenvalue with speed k λ . However, the curse of dimensionality does not show up because of comonotonicity: all entries of vector of functions β(·) are function of a single variable τ . Therefore
it is possible to use Sieve estimation to approximate the true functional parameter with k
√
growing slower than n.
We summarize the tail property of f in the following condition:
Condition C9. For any θk ∈ Θk , assume there exists a generic constant C such that V ar((
C for any βk and σ in a fixed neighborhood of θ0 .
fβk gσ
g0 , g0 ))
Remark. The above condition is true for the Normal, Laplace, and Beta Distributions.
˜ Suppose the smallest eigenvalue of I,
˜ r(I)
˜ = ckλ . Suppose
Condition C10 (Eigenvector of I).
k
˜ and that k −α � min |vi | for some fixed α ≥ 1 .
v is a normalized eigenvector of r(I),
2
Theorem 1. Under conditions C1-C5 and C8-C10, the following results holds for the Sieve-ML
estimator:
(1) If the number of knots k satisfies the following growth condition:
2λ+1
(a) k √n → 0,
(b)
λ+r
k√
n
→ ∞,
λ
k
then ||θ − θ0 || = Op ( √
).
n
(2) If condition 11 holds and the following growth conditions hold
2λ+1
(a) k √n → 0
(b)
kλ+r−α
√
n
→ ∞,
<
12
HAUSMAN, LUO, AND PALMER
then (a) for every i = 1, ..., k, there exists a number µijk , such that
and
µijk
kr
→ 0,
kλ−α
µijk
= O(1),
µijk (βk,j (τi ) − β0,j (τi )) −
→ N (0, 1).
d
and (b) for the parameter σ, there exists a positive definite matrix V of dimension dσ × dσ
such that the sieve estimator satisfies:
√
n(σk − σ0 ) → N (0, V ).
�
Proof. See Appendix B.3.
Once we fix the number of interior points, we can perform MLE to estimate the Sieve
estimator. We discuss how to compute the Sieve-ML estimator in the next section.
4.1. Inference via Bootstrap. In the last section we proved asymptotic normality for the
Sieve-ML estimator θ = (β(τ ), σ). However, computing the convergence speed µijk for βk,j (τi )
by explicit formula can be difficult in general. To conduct inference, we recommend using the
nonparametric bootstrap.
B
Define (xB
i , yi ) as a resampling of data (xi , yi ) with replacement. Define the estimator
B B
θkB = arg max EB
n [log g(yi |xi , θk )],
θk ∈Θk
(4.5)
B B
where EB
n denotes the operator of empirical average of resampled data (xi , yi ).
Chen and Pouzo (2013) establishes results in validity of the nonparametric bootstrap in
semiparametric models for a general functional of a parameter θ. The Lemma below is an
implementation of theorem (5.1) of Chen and Pouzo (2013).
Lemma 9 (Validity of the Bootstrap). Under condition C1-C6 and C9, choosing the number
of knots k according to the condition stated in Theorem 1, the estimator via bootstrap defined
in equation (4.10) has the following property:
βk,j (τ )B − βk,j (τ ) ∗
−
→ N (0, 1)
d
µijk
Proof. See Appendix B.3.
(4.6)
�
5. Algorithm for Computing Sieve MLE
5.1. Global Optimization and Flexible Distribution Function. In the maximization
problem (4.1), the dimension of the parameter is possibly large. As the number of knots k
grows, the dimensionality of the parameter space M becomes large compared to usual practical
econometric applications. In addition, the parameter space Ω can be complex if we consider
more flexible distribution families of the EIV, e.g., mixture of normals.
ERRORS IN THE DEPENDENT VARIABLE OF QUANTILE REGRESSION MODELS
13
More specifically, assume the number of mixture components is w. Then the parameter space
Ω is defined as a 3w − 2 dimensional space.
Consider (p, µ, σ) ∈ Ω. p ∈ [0, 1]w−1 is a w − 1 dimensional vector indicating the weight of
�
w−1 is
the first w − 1 components. The weight of the last component is 1 − w−1
i=1 pi . µ ∈ R
defined
as the mean of the first w − 1 components. The mean of the last component is defined
�
as −
w−1
i=1 pi µi
�
.
w−1
i=1 pi
σ ∈ Rw is defined as the vector of standard deviation of each component.
In general, if the number of knots is k, and the number of mixture components is w, then
the total number of parameters is dx k + 3w − 2.
Since the number of parameters is potentially large and the maximization problem (4.1) is
typically highly nonlinear, usual (local) optimization algorithms do not generally work well and
can be trapped at a local optimum. We recommend using global optimization algorithms such
as the genetic algorithm or the efficient global optimization algorithm (EGO) to compute the
sieve estimator suggested.
While such global algorithms have good precision, these methods can be slow. Thus, we
provide an EM-algorithm which is locally computational tractable in the next subsection.
5.2. EM Algorithm for Finding Local Optima of the Likelihood Function. In this
section, we propose a special EM algorithm in real application to generate a consistent estimator.
Definition 3. Define κ(τ |x, y, θ) :=
Define
´1
0
f (y−xβ(τ )|σ)
f (y−xβ(τ )|σ)dτ
as the Bayesian probability of τ |x, y, θ.
H(θ� |θ) = Ex,τ [log κ(τ |x, y, θ� )|x, y, θ]
�ˆ
�
∞
f (y − xβ(τ )|σ)
f (y − xβ � (τ )|σ � )
= Ex,τ
log ´ 1
dy
´1
�
�
−∞ 0 f (y − xβ(u)|σ)du
0 f (y − xβ (u)|σ )du
as the expectation of the log Bayesian likelihood over τ ∈ [0, 1] conditional on observation (x, y)
and parameter θ� , given a prior θ.
Define
Q(θ� |θ) = Ex,τ [κ(τ |x, y) log f (y − xβ(τ ))]
= L(θ� ) + H(θ� |θ)
In the EM algorithm setting, each maximization step will weakly increase the likelihood L,
and thus the algorithm leads to maximum likelihood estimator of the likelihood function L.
Lemma 10 (Local Optimality of the EM Algorithm). The EM Algorithm solves the following
problem:
θp+1 ∈ arg max Q(θ|θp )
(5.1)
θ
produces a sequence
θp
given an initial value
θ0 .
If the sequence has the following property:
14
HAUSMAN, LUO, AND PALMER
(1) The sequence L(θp ) is bounded.
(2) L(θp+1 |θp ) − L(θp |θp ) ≥ 0.
Then the sequence θp converges to some θ∗ in D. And by property (2), θ∗ is a local optimum
of the log likelihood function L.
�
Proof. See Appendix B.3.
This Lemma says, instead of maximizing the log likelihood function, we can run iterations
conditional on the previous estimation of θ to obtain a sequence of estimates θp converging to
the maximum likelihood estimator. The computation steps are much more attractive than the
direct minimization of the log-likelihood function L(θ), since in each step we only need to do
the partial maximization taking the prior estimates as known parameters. The next section
presents a simpler version of the algorithm as weighted least squares when the distribution
function f (·|σ) is in the normal family.
The maximization of the Q function based on the last step is simpler than the global optimization algorithm in the following sense: Unlike the direct maximization of L(θ) in the space
D, the maximization of the Q(|θp ) function is a process of solving maximization problems for
each different τ , which involves many fewer number of parameters. More specifically, the maximization problem in equation (5.5) is equivalent to the following maximization problem for
each different τ given fixed σ � :
�
�
�
f (y − xβ(τ )|σ)
�
�
�
log f (y − xβ (τ )|σ )dy .
(5.2)
f
(y
−
xβ(u)|σ)du
0
The EM algorithm is much easier to adopt compared to searching in the functional space
M to solve for the global maximizer of the log-likelihood function L(θ). The EM algorithm is
useful to obtain a local optimum, but not necessarily a global optimum. However, in practice,
if we start with a solver of (4.1) by using the global optimization algorithm and then apply the
EM algorithm, we could potentially improve the speed of computation compared to directly
using the genetic algorithm.
(β(τ ) , σ ) ∈ arg max Ex ´ 1
5.3. Weighted Least Squares. Under normality assumption of the EIV term ε, the maximization of Q(·|θ) will be extremely simple: it is reduced to the minimization of a weighted least
square problem. In the last section, in equation (3.21), suppose the disturbance ε ∼ N (0, σ 2 ).
Then the maximization problem (4.1) becomes the following, with the parameter θ = [β(·), σ]
�
�
�
�
�
arg max
Q(θ
|θ)
:=
E
log(f
(y
−
xβ
(τ
))|θ
)κ(x,
y,
θ)|θ
(5.3)
�
=E
�ˆ
θ
1
τ =0
´1
0
f (y − xβ(τ )|σ)
f (y − xβ(u)|σ)du
�
1
(y − xβ � (τ ))2
− log(2πσ �2 ) −
2
2σ �2
�
�
dτ .
It is easy to see from the above equation that the maximization problem of β � (·)|θ is to
minimize the sum of weighted least squares. As in standard normal MLE, the FOC for β � (·)
ERRORS IN THE DEPENDENT VARIABLE OF QUANTILE REGRESSION MODELS
15
does not depend on σ �2 . The σ �2 is solved after all the β � (τ ) are solved from equation (5.3).
Therefore, the EM algorithm becomes a simple weighted least square algorithm, which is both
computationally tractable and easy to implement in practice.
Given an initial estimate of a weighting matrix W , the weighted least squares estimates of
β and σ are
� k ) = (X � Wk X)−1 X � Wk y
β(τ
�
1 ��
σ
� =
wik ε�2ik
NK
k
i
where Wk is the diagonal matrix formed from the k th column of W , which has elements wik .
� k ) and σ
Given estimates ε�k = y − X β(τ
�, the weights wik for observation i in the estimation
of β(τk ) are
φ (�
εik /�
σ)
wik = 1 �
(5.4)
φ (�
εik /�
σ)
K
k
where K is the number of τ s in the sieve, e.g. K = 9 if the quantile grid is {τk } =
{0.1, 0.2, ..., 0.9}, and φ(·) is the pdf of a standard normal distribution.
6. Monte-Carlo Simulations
We examine the properties of our estimator empirically in Monte-Carlo simulations. Let the
data-generating process be
yi = β0 (ui ) + x1i β1 (ui ) + x2i β2 (ui ) + εi
where N = 1, 000, the conditional quantile ui of each individual is u ∼ U [0, 1], and the measurement error ε is drawn from a mixed normal distribution



N (−3, 1) with probability 0.5
εi ∼



N (2, 1)
N (4, 1)
with probability 0.25
with probability 0.25
The covariates are functions of random normals
x1i = log |z1i | + 1
x2i = log |z2i | + 1
where z1 , z2 ∼ N (0, I2 ). Finally, the coefficient vector is a function of the conditional quantile
ui of individual i




β0 (u)
1 + 2u − u2




 β1 (u)  =  12 exp(u)  .
β2 (u)
u+1
16
HAUSMAN, LUO, AND PALMER
−.6
−.4
−.2
Mean Bias
0
.2
.4
.6
Figure 2. Monte Carlo Simulation Results: Mean Bias of β1 (τ )
0
.1
.2
.3
.4
.5
Quantile
.6
.7
.8
.9
1
Quantile Regression
Bias−Corrected
Bias−Corrected 95% Confidence Intervals
Simulation Results. In Figures 2 and 3, we plot the mean bias of quantile regression of y
on a constant, x1 , and x2 and contrast that with the mean bias of our estimator. Quantile
regression is badly biased, with lower quantiles biased upwards towards the median-regression
coefficients and upper quantiles biased downwards towards the median-regression coefficients.
We also plot bootstrapped 95% confidence intervals (dashed green lines) to show the precision
of our estimator. Especially in the case of nonlinear β1 (·) in Figure 2, we reject the quantile
coefficients, which fall outside the 95% confidence interval of our estimates.
Figure 4 shows the true mixed-normal distribution of the measurement error ε as defined
above (solid blue line) contrasted with the estimated distribution of the measurement error
(dotted red line) and the 95% confidence interval of the estimated density (dashed green line).
Despite the bimodal nature of the true measurement error distribution, our algorithm successfully estimates its distribution with very little bias and relatively tight confidence bounds.
7. Empirical Application
Angrist et al. (2006) estimate that the heterogeneity in the returns to education across the
conditional wage distribution has increased over time, as seen below in Figure 5, reproduced
ERRORS IN THE DEPENDENT VARIABLE OF QUANTILE REGRESSION MODELS
17
−.6
−.4
−.2
Mean Bias
0
.2
.4
.6
Figure 3. Monte Carlo Simulation Results: Mean Bias of β2 (τ )
0
.1
.2
.3
.4
.5
Quantile
.6
.7
.8
.9
1
Quantile Regression
Bias−Corrected
Bias−Corrected 95% Confidence Intervals
from Angrist et al. Estimating the quantile-regression analog of a simple Mincer regression,
they estimate models of the form
qy|X (τ ) = β0 (τ ) + β1 (τ )educationi + β2 (τ )experiencei + β3 (τ )experience2i + β4 (τ )blacki (7.1)
where qy|X (τ ) is the τ th quantile of the conditional (on the covariates X) log-wage distribution,
the education and experience variables are measured in years, and black is an indicator variable.
They find that in 1980, an additional year of education was associated with a 7% increase in
wages across all quantiles, nearly identical to OLS estimates. In 1990, they again find that
most of the wage distribution has a similar education-wage gradient but that higher conditional
(wage) quantiles saw a slightly stronger association between education and wages. By 2000,
the education coefficient was almost seven percentage points higher for the 95th percentile than
for the 5th percentile.
We observe a different pattern when we correct for measurement-error bias in the selfreported wages used in the Angrist et al. census data to compare results using our estimation
procedure with standard quantile regression.8 We first use our genetic algorithm to maximize
8We are able to replicate the Angrist et al. quantile regression results (shown in the dashed blue line) almost
exactly. We estimate β(τ ) over a τ grid of 99 knots that stretches to 0.01 and 0.99. However, we do not report
results for quantiles τ < .1 or τ > .9 because of how unstable and noisy these results are.
18
HAUSMAN, LUO, AND PALMER
0
.05
Density
.1
.15
.2
Figure 4. Monte Carlo Simulation Results: Distribution of Measurement Error
−8
−6
−4
True Density
−2
0
2
Measurement Error
Estimated Density
4
6
8
95% CI
the Sieve likelihood with an EIV density f that is a mean-zero mixture of three normal distributions with unknown variances. The genetic algorithm excels at optimizing over non-smooth
surfaces where other algorithms would be sensitive to start values. However, running the genetic algorithm on our entire dataset is computationally costly since the data contain over
65,000 observations for each year and our parameter space is large. To speed up convergence,
we average coefficients estimated on 100 subsamples of 2,000 observations each. This procedure
gives consistent parameter estimates and nonparametric confidence intervals provided the ratio
of the subsample size and the original number of observations goes to zero in the limit. The
95% nonparametric confidence intervals are calculated pointwise as the .025 and .975 quantiles of the empirical distribution of the subsampling parameter estimates. Figure 6 plots the
estimated distribution of the measurement error (solid blue line) in the 2000 data, and the
dashed red lines report the 95% confidence bounds on this estimate, obtained via subsampling.
The estimated EIV distributions for the 1980 and 1990 data look similar—we find that the
distribution of the measurement error in self-reported log wages is approximately Gaussian.
Figure 7 plots the education coefficient β�1 (τ ) from estimating equation (7.1) by weightedleast squares.9 While not causal estimates of the effect of education on wages, the results
9Because preliminary testing has shown that the weighted-least squares approach improves upon the optimum
reached by the genetic algorithm, we continue to refine our genetic algorithm. The coefficient results we report
ERRORS IN THE DEPENDENT VARIABLE OF QUANTILE REGRESSION MODELS
19
Figure 5. Angrist, Chernozhukov, and Fernandez-Val (2006) Figure 2
suggest that in 1980, education was more valuable at the middle and top of the conditional
income distribution. This pattern contrasts with the standard quantile-regression results, which
show constant returns to education across quantiles of the conditional wage distribution (as
discussed above). For 1990, the pattern of increasing returns to education for higher quantiles
is again visible with the very highest quantiles seeing a large increase in the education-wage
gradient. The estimated return to education for the .9 quantile is over twice as large as estimates
obtained from traditional quantile regression. In the 2000 decennial census, the returns to
education appear nearly constant across most of the distribution, with two notable exceptions.
The lowest income quantiles have a noticeably lower wage gain from a year of education than
middle-income quantiles. Second, the top-income increases in the education-wage premium that
were present in the 1990 results are higher in 2000. The 90th percentile of the conditional wage
distribution in 2000 increases by over 30 log points for every year of education. Over time—
especially between 1980 and 1990—we see an overall increase in the returns to education. While
this trend is consistent with the observations of Angrist et al. (2006), the distributional story
below are from weighted-least squares, which assumes normality of the measurement error. As Figure 6 shows,
this assumption seems to be consistent with the data.
20
HAUSMAN, LUO, AND PALMER
0
.2
Density
.4
.6
.8
Figure 6. Estimated Distribution of Wage Measurement Error
−4
−2
0
Measurement Error
Estimated Density
2
4
95% Confidence Intervals
Notes: Graph plots the estimated distribution of the measurement eror
that emerges from correcting for measurement error suggests that the returns to education have
become disproportionately concentrated in high-wage professions.
8. Conclusion
In this paper, we develop a methodology for estimating the functional parameter β(·) in
quantile regression models when there is measurement error in the dependent variable. Assuming that the measurement error follows a distribution that is known up to a finite-dimensional
parameter, we establish general convergence speed results for the MLE-based approach. Under
a discontinuity assumption (C8), we establish the convergence speed of the Sieve-ML estimator.
We use genetic algorithms to maximize the Sieve likelihood, and show that when the distribution of the EIV is normal, the EM algorithm becomes an iterative weighted least square
algorithm that is easy to compute. We demonstrate the validity of bootstrapping based on
asymptotic normality of our estimator and suggest using a nonparametric bootstrap procedure
for inference.
Finally, we revisited the Angrist et al. (2006) question of whether the returns to education
across the wage distribution have been changing over time. We find a somewhat different
ERRORS IN THE DEPENDENT VARIABLE OF QUANTILE REGRESSION MODELS
21
Figure 7. Returns to Education Correcting for LHS Measurement Error
.4
.3
.1
0
.1
0
.1 .2 .3 .4 .5 .6 .7 .8 .9
Quantile
2000
.2
.3
.4
1990
.2
.3
.2
.1
0
Education Coefficient
.4
1980
.1 .2 .3 .4 .5 .6 .7 .8 .9
Quantile
.1 .2 .3 .4 .5 .6 .7 .8 .9
Quantile
Quantile Regression
Bias−Corrected Quantile Regression
Notes: Graphs plot education coefficients estimated using quantile regression and our bias-correcting
algorithm. The dependent variable is log wage for prime age males, and other controls include a
quadratic in years of experience and a dummy for blacks. The data comes from the indicated
decennial Census and consists of 40-49 year old white and black men born in America. The number of
observations in each sample is 65,023, 86,785, and 97,397 in 1980, 1990, and 2000, respectively.
pattern than prior work, highlighting the importance of correcting for errors in the dependent
variable of conditional quantile models. When we correct for likely measurement error in the
self-reported wage data, the wages of low-wage men had much less to do with their education
levels than they did for those at the top of the conditional-wage distribution. By 2000, the wageeducation gradient was nearly constant across most of the wage distribution, with somewhat
lower returns for low-wage earners and significant growth in the predictability of education on
top wages.
References
[1] Angrist, J., Chernozhukov, V. and Fernandez-Val, I. (2006), Quantile Regression under Misspecification,
with an Application to the U.S. Wage Structure. Econometrica, 74: 539-563.
[2] Burda, M., Harding, M. and Hausman, J. (2010), A Poisson Mixture Model of Discrete Choice, Journal of
Econometrics, 166(2) (2010) 184-203.
[3] Carroll, Raymond J. and Wei, Y., (2009), "Quantile Regression With Measurement Error", Journal of the
American Statistical Association, 104, issue 487, p. 1129-1143.
22
HAUSMAN, LUO, AND PALMER
[4] Songnian Chen (2002). "Rank Estimation of Transformation Models," Econometrica, vol. 70(4), pages
1683-1697, July.
[5] Xiaohong Chen (2009), "Large sample sieve estimation of semiparametric models", Cowles Foundation
Paper No. 1262.
[6] Xiaohong Chen, Demian Pouzo (2013), “Sieve quasi likelihood ratio inference on semi/nonparametric conditional moment models”, Cowles fundation paper No. 1897.
[7] Chernozhukov, V., Fernandez-Val, I. and Galichon, A., “Improving Point and Interval Estimates of Monotone Functions by Rearrangement"
[8] Chernozhukov, Victor and Hansen, Christian (2005), "An IV Model of Quantile Treatment Effects," Econometrica, vol. 73(1), pages 245-261, 01.
[9] Chernozhukov, Victor and Hansen, Christian (2008) "Instrumental variable quantile regression: A robust
inference approach," Journal of Econometrics, vol. 142(1), pages 379-398, January.
[10] Cosslett, S. R. (2004), Efficient Semiparametric Estimation of Censored and Truncated Regressions via a
Smoothed Self-Consistency Equation. Econometrica, 72: 1277–1293.
[11] Cosslett, S. R. (2007), Efficient Estimation of Semiparametric models by Smoothed Maximum Likelihood.
International Economic Review, 48: 1245–1272.
[12] Dempster, A.P., Laird. N. M. and Rubin D.B. (1977), Journal of the Royal Statistical Society, Vol. 39, No.
1., pp. 1-38
[13] Evdokimov, K., Identification and Estimation of a Nonparametric Panel Data Model with Unobserved
Heterogeneity, mimeo.
[14] Hausman J.A., Lo A.W., MacKinlay A.C. An ordered probit analysis of transaction stock prices (1992)
Journal of Financial Economics, 31 (3) , pp. 319-379.
[15] Hausman J.A., Abrevaya, Jason, and Scott-Morton, Fiona. (1998) Misclassification of a dependent variable
in qualitative response model, Journal of econometrics, 87, 239-269.
[16] Hausman J.A., (2001) Mismeasured variables in econometrics problems from the right and problems from
the left, Journal of Economic perspectives, 15, 57-67.
[17] Horowitz, J. L. (2011), Applied Nonparametric Instrumental Variables Estimation. Econometrica, 79: 347394.
[18] Kuhn, T., (1990)Eigenvalues of integral operators with smooth positive definite kernels, Archiv der Mathematik 2. 12. 1987, Volume 49, Issue 6, pp 525-534
[19] Shen, Xiaotong, "On methods of Sieve and Penalization", The Annals of Statistics, vol. 25, 1997.
[20] Schennach, Susanne M., 2008. "Quantile Regression With Mismeasured Covariates," Econometric Theory,
vol. 24(04), pages 1010-1043, August.
ERRORS IN THE DEPENDENT VARIABLE OF QUANTILE REGRESSION MODELS
23
Appendix A. Deconvolution Method
Although the direct MLE method is used to obtain identification and some asymptotic
properties of the estimator, the main asymptotic property of the maximum likelihood estimator
depends on discontinuity assumption C.9. Evodokimov (2011) establishes a method using
deconvolution to solve a similar problem under panel data condition. He adopts a nonparametric
setting and therefore needs kernel estimation method.
Instead, in the special case of quantile regression setting, we can further explore the linear
structure without using the nonparametric method. Deconvolution offers a straight forward
view of the estimation procedure.
The CDF function of a random variable w can be computed from the characteristic function
φw (s).
1
F (w) = − lim
2 q→∞
ˆ
q
−q
e−iws
φw (s)ds.
2πis
(A.1)
β0 (τ ) satisfying the following conditional moment condition:
(A.2)
E[F (xβ0 (τ )|x)] = τ.
Since we only observe y = xβ(τ ) + ε, the φxβ (s) =
E[exp(isy)]
.
φε (s)
Therefore β0 (τ ) satisfies:
ˆ q
1
E[e−i(y−xβ0 (τ ))s |x]
E[ − τ − lim
ds|x] = 0
(A.3)
q→∞ −q
2
2πisφε (s)
In the above expression, the limit symbol before the integral can not be exchanged with the
followed conditional expectation symbol. But in practice, they can be exchanged if the integral
is truncated.
A simple way to explore information in the above conditional moment equation is to consider
the following moment conditions:
� �
��
ˆ q
1
E[e−i(y−xβ0 (τ ))s |x]
E x
− τ + Im( lim
ds)
=0
(A.4)
q→∞ −q
2
2πsφε (s)
�´
�
−i(y−xβ)s|x
q
Let F (xβ, y, τ, q) = 12 − τ − Im −q e 2πsφε (s) ds .
Given fixed truncation threshold q, the optimal estimator of the kind En [f (x)F (xβ, y, τ, q)]
is:
∂F
|x]/σ(x)2 F (xβ, y, τ, q)] = 0,
∂β
where σ(x)2 := E[F (xβ, y, τ )2 |x].
Another convenient estimator is:
En [E[
(A.5)
24
HAUSMAN, LUO, AND PALMER
arg min En [F (xβ, y, τ, q)2 ].
(A.6)
β
These estimators have similar behaviors in term of convergence speed, so we will only discuss
2
the optimal estimator. Although it can not be computed directly because the weight ∂F
∂β /σ(x)
depends on β, it can be achieved by an iterative method.
Theorem 2. (a) Under condition OS with λ ≥ 1, the estimator β satisfying equation (3.15)
1
with truncation threshold qn = CN 2λ satisfies:
−1
(A.7)
β − β0 = O(N N 2λ ),
(b) Under condition SS with, the estimator β� satisfies equation (3.15) with truncation thresh1
old qn = C log(n) λ satisfies:
−1
β − β0 = O(log(n) λ ).
(A.8)
Appendix B. Proofs of Lemmas and Theorems
B.1. Lemmas and Theorems in Section 2. In this section we prove the Lemmas and
Theorems in section 2.
Proof of Lemma 1.
Proof. For any sequence of monotone functions f1 , f2 , ..., fn , ... with each one mapping [a, b] into
some closed interval [c, d]. For bounded monotonic functions, point convergence means uniform
convergence, therefore this space is compact. Hence the product space B1 × B2 × ... × Bk is
compact. It is complete since L2 functional space is complete and limit of monotone functions
is still monotone.
�
Proof of Lemma 2.
Proof. WLOG, under condition C.1-C.3, we can assume the variable x1 is continuous. If there
exists β(·) and f (·) which generates the same density g(y|x, β(·), f ) as the true paramter β0 (·)
and f0 (·), then by applying Fourier transformation,
φ(s)
ˆ
1
0
Denote m(s) =
m(s)
ˆ
0
φ(s)
φ0 (s)
=1+
1
exp(isx−1 β−1 (τ ))
exp(isβ(τ ))dτ = φ0 (s)
�∞
1
exp(isβ0 (τ ))dτ.
0
k=2 ak (is)
k
around a neighborhood of 0. Therefore,
∞
�
(is)k xk β1 (τ )k
i=0
ˆ
1
k!
dτ =
ˆ
0
1
exp(isx−1 β0,−1 (τ ))
∞
�
(is)k xk β0,1 (τ )k
1
i=0
k!
dτ.
Since x1 is continuous, then it must be that the corresponding polynomials of x1 are the same
for both sides. Namely,
ERRORS IN THE DEPENDENT VARIABLE OF QUANTILE REGRESSION MODELS
(is)k
m(s)
k!
ˆ
1
(is)k
exp(isx−1 β−1 (τ ))β1 (τ ) dτ =
k!
k
0
1
ˆ
0
25
exp(isx−1 β0.−1 (τ ))β0,1 (τ )k dτ.
Divide both sides of the above equation by (is)k /k!, as let s approaching 0, we get:
ˆ
1
k
β1 (τ ) dτ =
0
ˆ
1
β0,1 (τ )k dτ.
0
By assumption C.2, β1 (·) and β0,1 (·) are both strictly monotone, differentiable and greater
than or equal to 0. So β1 (τ ) = β0,1 (τ ) for all τ ∈ [0, 1].
Now consider the same equation considered above. divide both sides by (is)k /k!, we get
´1
´1
m(s) 0 exp(isx−1 β−1 (τ ))β0,1 (τ )k dτ = 0 exp(isx−1 β0.−1 (τ ))β0,1 (τ )k dτ,for all k ≥ 0.
Since β0,1 (τ )k , k ≥ 1 is a functional basis of L2 [0, 1], therefore m(s) exp(isx−1 β−1 (τ )) =
exp(isx−1 β0,−1 (τ )) for all s in a neighborhood of 0 and all τ ∈ [0, 1]. If we differentiate both
sides with respect to s and evaluate at 0 (notice that m� (0) = 0), we get:
x−1 β−1 (τ ) = x−1 β0,−1 (τ ),
for all τ ∈ [0, 1].
By assumption C.2, E[x� x] is non-singular. Therefore E[x�−1 x−1 ] is also non-singular. Hence,
the above equation suggests β−1 (τ ) = β0,−1 (τ ), for all τ ∈ [0, 1]. Therefore, β(τ ) = β0 (τ ).This
implies that φ(s) = φ0 (s). Hence f (ε) = f0 (ε).
�
Proof of Lemma 3.
Proof. Proof: (a) If the equation stated in (a) holds a.e., then we can apply Fourier transformation on both sides, i.e., conditional on x,
ˆ
e
isy
ˆ
1
fy (y − xβ(τ ))xt(τ )dτ dy =
0
ˆ
e
isy
ˆ
0
1ˆ 1
0
fσ (y − xβ(τ ))dτ dy.
(B.1)
Let q(τ ) = xβ(τ ), which is a strictly increasing function. We can use change of variables to
change τ to q.
We get:
ˆ 1
ˆ
ˆ 1
ˆ
isxβ(τ )
is(y−xβ(τ )) �
isxβ(τ )
e
xt(τ ) e
f (y − xβ(τ ))dydτ =
e
eis(y−xβ(τ )) fσ (y − xβ(τ ))dydτ.
0
0
(B.2)
or, equivalently,
ˆ
1
e
isxβ(τ )
xt(τ )dτ (−is)φε (s) =
0
´ �
´
given that f (y)eisy dy = −is f (y)eisy dy.
ˆ
0
1
eisxβ(τ ) dτ
∂φε (s)
,
∂σ
(B.3)
26
HAUSMAN, LUO, AND PALMER
Denote
´1
0
eisxβ(τ ) xt(τ )dτ = q(x, s), and
´1
0
eisxβ(τ ) dτ = r(x, s). Therefore,
∂φε (s)
.
∂σ
ε (s)
First we consider normal distribution. φε (s) = exp(−σ 2 s2 /2). So ∂φ∂σ
= −σs2 φε (s).
Therefore, q(s, x) = isσr(s, x), for all s and x with positive probability. Since xβ(τ ) is
strictly increasing, let z = xβ(τ ), and I(y) = 1(xβ(0) ≤ y ≤ xβ(1)).
´∞
´ ∞ isw
xt(z −1 (w))
1
So q(s, x) = −∞ eisw { xz
{ xz � (z −1
I(w)}dw.
� (z −1 (w)) I(w)}dw, and r(s, x) = −∞ e
(w))
q(s, x) = isσr(s, x) means that there is a relationship between the integrands. More specifically, it must be that
−isq(s, x)φ(s) = r(s, x)
xt(z −1 (w))
1
I(w) = d � −1
I(w)/dw.
(B.4)
�
−1
xz (z (w))
xz (z (w))
Obviously they don’t equal to each other for a continuous interval of xi , because the latter
function has a δ function component.
ε (s)
ε (s)
For more general distribution function f , −isq(s, x)φε (s) = r(s, x) ∂φ∂σ
, with ∂φ∂σ
=
�
∞
2
j
φε (s)s m(s), for some m(s) := j=0 aj (is) .
Therefore,
−ism(s)r(s, x) = q(s, x).
Consider local expansion of s in both sides, we get:
−is(
∞
�
j=0
aj (is)j )(
∞ ˆ
�
k=0
1
(xβ(τ ))k dτ
0
∞
�
(is)k
)=
k!
j=0
ˆ
1
(xβ(τ ))k xt(τ )dτ
0
(is)k
.
k!
Compare the k th polynomial of s, we get:
´1
(xβ(τ ))k xt(τ )dτ = 0, if k = 0.
´01
´1
j−1
�
k
j
j−1 dτ (is)
1≤j<k aj (is) ( 0 xβ(τ ))
0 (xβ(τ )) xt(τ )dτ = −
(j−1)! .
Let x1 be the continuous variable, and x−1 be the variables left. Let β1 be the coefficient
corresponds to x1 and let β−1 be the coefficient corresponds to x−1 . Let t1 be the component
in t that corresponds to x1 , and let t−1 be the component in t that corresponds to x−1 .
Then, consider the (k + 1)th order polynomial of x1 , we get:
ˆ
1
β1 (τ )k t1 (τ )dτ = 0.
0
since β1 (τ )k , k ≥ 0 form a functional basis in L2 [0, 1], t1 (τ ) must equal to 0.
Consider the k th order polynomial of x1 , we get:
ˆ
0
1
β1 (τ )k (x−1 t−1 (τ ))dτ = 0.
ERRORS IN THE DEPENDENT VARIABLE OF QUANTILE REGRESSION MODELS
27
So x−1 t−1 (τ ) = 0 for all τ ∈ [0, 1]. Since E[x−1 x�−1 ] has full rank, it must be that t−1 (τ ) = 0.
Hence local identification condition holds.
�
B.2. Lemmas and Theorems in Section 3. In this subsection, we prove the Lemmas and
Theorems in section 3.
Lemma 11 (Donskerness of D). The parameter space D is Donsker.
Proof. By theorem 2.7.5 of Van der Vaart and Wellner (2000), the space of bounded monotone
function, denoted as F, satisfies:
r.
1
logN[] (ε, F, Lr (Q)) ≤ K ,
ε
for every probability measure Q and every r ≥ 1, and a constant K which depends only on
Since Dis a product space of bounded monotone functions M and a finite dimensional
bounded compact set Ω, therefore the braketing number of D given any measure Q is also
bounded by:
1
logN[] (ε, D, Lr (Q)) ≤ Kdx ,
ε
Therefore,
´δ�
log N[] (ε, D, Q) < ∞, i.e., D is Donsker.
0
�
Proof of Lemma 5.
Proof. Ln (θ) ≥ Ln (θ0 ) by definition, and we know that L(θ0 ) ≥ L(θ) by information inequality.
By Lemma 10, since D is Donsker, then the following stochastic equicontinuity condition
holds:
√
n(Ln (θ) − Ln (θ1 ) − (L(θ) − L(θ1 ))) →p rn ,
for some generic term rn = op (1), if |θ − θ1 | ≤ ηn for small enough ηn . Therefore uniform
convergence of Ln (θ) − L(θ) holds in D. In addition, by Lemma 2, the parameter θ is globally
identified. Also, it is obvious that L is continuous in θ.
Therefore, the usual consistency result of MLE estimator applies here.
�
Proof of Lemma 6.
Proof. By definition,
En [log(g(y|x, β, σ))] ≥ En [log(g(y|x, β0 , σ0 ))].
(B.5)
And we know that by information inequality, E[log(g(y|x, β, σ))] ≤ E[log(g(y|x, β0 , σ0 ))].
Therefore, E[log(g(y|x, β, σ))]−E[log(g(y|x, β0 , σ0 ))] ≥ √1n Gn [log(g(y|x, β, σ))−g(y|x, β0 , σ0 ))].
By Donskerness of parameter space D, the following equicontinuity condition holds:
28
HAUSMAN, LUO, AND PALMER
Hence,
1
0 ≤ E[log(g(y|x, β0 , σ0 ))] − E[log(g(y|x, β, σ))] �p √ .
(B.6)
n
√
Because σ is converging to the true parameter in n speed, we will simply use σ0 here in
the argument because it won’t affect our analysis later. In the notation, I will abbreviate σ0 .
Let z(y, x) = g(y|β, x) − g(y|β0 , x). In equation 3.19, we know that the KL-discrepancy of
g(y|x, β0 ) and g(y|x, β), noted as D(g(·|β0 )||g(·|β)), is bounded by C √1n with p → 1.
So Ex [||z(y|x)||1 ]2 ≤ D(g(·|β0 )||g(·|β)) ≤ 2(E[log(g(y|x, β0 , σ0 ))] − E[log(g(y|x, β, σ))]) �p
´∞
√1 , where |z(y|x)|1 =
−∞ |z(y|x)|dy.
n
Now consider the characteristic function of y conditional on x:
φxβ (s) =
´∞
φxβ0 (s) =
´∞
−∞ g(y|β, x)e
isy dy
φε (s)
,
and
−∞ g(y|β0 , x)e
φε (s)
Therefore
φxβ (s) − φxβ0 (s) =
||z(y|x)||1
.
´|φqε (s)|
−iws
limq→∞ −q e 2πs φη (s)ds.
and |φxβ (s) − φxβ0 (s)| ≤
Let Fη (w) =
η.
1
2
−
isy dy
´∞
−∞ z(y|x)e
φε (s)
.
isy dy
,
(B.7)
Then F (w) is the CDF of the random variable
´ q −iws
Let Fη (w, q) = 12 − −q e 2πs φη (s)ds. If the p.d.f of η is a C 1 function and η has finite second
order moment, then Fη (w, q) = Fη (w) + O( q12 ). Since x and β are bounded, let η = xβ(τ ),
then there exists a common C such that |Fxβ (w, q) − Fxβ (w)| ≤ qC2 for all x ∈ X and w ∈ R
for q being large enough.
´ q isw
´ q ||z(y|x)||1
Therefore |Fxβ (w, q) − Fxβ0 (w, q)| = | −q e2πs (φxβ (s) − φxβ0 (s))ds| ≤ | −q |φ2πs
ds| �
ε (s)|
||z(y|x)||1 χ(q).
Consider the follow moment condition:
E[x(Fxβ0 (xβ0 ) − τ )] = 0
(B.8)
Let F0 be Fxβ0 and F be Fxβ .
We will replace the moment condition by the truncated function F (w, q). So by the truncation property, for any fixed δ ∈ Rdx :
δE[x(F0 (xβ) − F0 (xβ0 ))] = δE[x(F0 (xβ) − F (xβ))] = δE[x(F0 (xβ, q) − F (xβ, q))] + O(
1
).
q2
(B.9)
ERRORS IN THE DEPENDENT VARIABLE OF QUANTILE REGRESSION MODELS
29
The first term of the right hand side is bounded by:
E[|δx|χ(q)||z(y|x)||1 ] � E[||z(y|x)||1 ]χ(q).
And δE[x(F0 (xβ) − F0 (xβ0 ))] = δE[x� xf (xβ0 (τ ))](β(τ ) − β0 (τ ))(1 + o(1)), where f (xβ0 (τ ))
is the p.d.f of the random variablexβ0 (·) evaluated at τ .
1
Hence, β(τ ) − β0 (τ ) = Op ( χ(q)
1 ) + O( q 2 ).
n4
1
−
1
Under OS condition, χ(q) � q λ , so by choosing q = Cn 4(λ+2) , ||β(τ ) − β0 (τ )||2 �p n 2(λ+2) .
1
Under SS condition, χ(q) � C1 (1 + |q|C2 ) exp(|q|λ /C3 ). By choosing q = C log(n) 4λ , ||β(τ ) −
1
β0 (τ )||2 �p log(n)− 2λ .
�
B.3. Lemmas and Theorems in Sections 4 and 5. In this subsection we prove the Lemmas
and Theorems in section 4.
Proof of Lemma 8.
Proof. Suppose f satisfies discontinuity condition C8 with degree λ > 0.
Denote lk = (fτ1 , fτ2 , ..., fτk , gσ ).
´ (l p� )2
For any (pk , δ) ∈ Pk , pk Ik p�k = E[ R k gk dy] ≥ E[(lk p�k )2 ], since g is bounded from above.
1
Assume c := inf x∈§,τ ∈[0,1] (xβ � (τ )) > 0. Now consider an interval [y, y + 2λkc
].
�λ
i
2
b
Let S(λ) := i=0 (λCλ ) , where Ca stands for the combinatorial number choosing b elements
from a set with size a.
So,
S(λ)E[
ˆ
R
λ
�
(lk p�k )2
y] = E[ (Cλj )2
d
j=0
ˆ
R
(lk (y + j/(2λukc))p�k (y + j/(2λkc)))2 dy].
And by Cauchy-Schwarz inequality,
ˆ
λ
�
j 2
E[ (Cλ )
(lk (y + j/(2λukc))p�k (y + j/(2λukc)))2 dy]
R
j=0
ˆ �
λ
≥ E[ ( (−1)j Cλj lk (y + j/(2λukc))p�k (y + j/(2λukc)))2 dy].
R j=0
let Jki := [a + xβ( i+1/2
k )−
E[
λ
�
1
2λuk , a
(Cλj )2
j=0
≥ E[
ˆ
Jki
(
λ
�
j=0
ˆ
R
+ xβ( i+1/2
k )+
1
2λuk ],
so
(lk (y + j/(2λukc))p�k (y + j/(2λukc)))2 dy]
(−1)j Cλj lk (y + j/(2λukc))p�k (y + j/(2λukc)))2 dy].
30
HAUSMAN, LUO, AND PALMER
By discontinuity assumption, ,
λ
�
(−1)j Cλj lk (y + j/(2λukc))p�k (y + j/(2λukc))
j=0
=
k
cj �
1
c
�
(
x
p
+
xj pj ),
i
(uk)λ−1 (λ − 1)! i uk
j=1,j�=i
with cj uniformly bounded from above, since f (λ) is L1 Lipschitz except at a.
´
´
�
Notice that Jki does not intersect each other, so S(λ)E[ R (lk p�k )2 dy] ≥ S(λ) ki=1 E[ J i (lk p�k )2 dy]
k
�k
1
c
c� � k
� + cjk
2]
≥ E[ 2uk
{
(
x
p
x
p
)}
i=1 (uk)λ−1 (λ−1)! i i
j=1,j�=i j j
uk
For constant u being large enough (only depend on λ, sup cjk and c),
�
�k
cjk �k
c
1
1
2
2 2
2
E[ ki=1 { (uk)1λ−1 ( (λ−1)!
xi p�i + uk
j=1,j�=i xj pj )} ] ≥ c(λ) k2λ−1
j=1 E[xj pj ] � k2λ ||p||2 ,
with some constant c(λ) only depends on λ and u.
In addition, from condition C5, we know that θIk θ � ||δ||22 .
Therefore θIk θ � ||δ||22 + k12λ ||p||22 � k12λ ||θ||22 . Hence, the smallest eigenvalue r(Ik ) of Ik is
bounded by kcλ , with some generic constant c depending on λ, x, the the L1 Lipshitz coefficient
of f (k) at set R − [a − η, a + η].
�
Proof of Lemma 9.
Proof. The proof of this Lemma is based on Chen (2008) results of Sieve extremum estimator
consistency. Below I recall the five Conditions listed in Theorem 3.1 of Chen (2008). Theorem
3.1 in Chen (2008) shows that as long as the following conditions are satisfied, the Sieve
estimator is consistent.
Condition C11. [Identification] (1) L(θ0 ) < ∞, and if L(θ0 ) = −∞ then L(θ) > −∞ for all
θ ∈ Θk \ {θ0 } for all k ≥ 1.
(2)there are a nonincreasing positive function δ(·) and a positive function m(·) such that for
all ε > 0 and k ≥ 1,
L(θ0 ) −
sup
θ∈Θk :d(θ,θ0 )L(θk )≥ε
≥ δ(k)m(ε) > 0.
Condition C 12. [Sieve space] Θk ⊂ Θk+1 ⊂ Θ for all k ≥ 1; and there exists a sequence
πk θ0 ∈ Θk such that d(θ0 , πk θ0 ) → 0 as k → 0.
Condition C13. [Continuity] (1)L(θ) is upper semicontinuous on Θk under metric d(·, ·).
(2) |L(θ0 ) − L(πk(n) θ0 )| = o(δ(k(n))).
Condition C14. [compact Sieve space] Θk is compact under d(·, ·).
Condition C15. [uniform convergence] 1) For all k ≥ 1, plim supθ∈Θk |Ln (θ) − L(θ)| = 0. 2)
�
c(k(n)) = op (δ(k(n))) where �
c(k(n)) := supθ∈Θk |Ln (θ) − L(θ)|. 3) ηk(n) = o(δ(k(n))).
ERRORS IN THE DEPENDENT VARIABLE OF QUANTILE REGRESSION MODELS
31
These assumptions are verified below:
For Identification: Let d be the metric induced by theL2 norm || · ||2 defined on D. By
assumption C4 and compactness of D, L(θ0 ) − supθ∈Θ:d(θ,θ0 ) L(θk ) > 0 for any ε > 0. So let
δ(k) = 1, Identification condition holds.
For condition Sieve Space: Our Sieve space satisfies Θk ⊂ Θ2k ⊂ Θ. In general we can conside
the sequence Θ2k(1) , Θ2k(2) , ..., Θ2k(n) ... instead of Θ1 , Θ2 , Θ3 ... with k(n) being an increasing
function and limn→∞ k(n) = ∞.
For condition Continuity: Since we assume f is continuous and L1 Lipshitz, (1) is satisfied.
(2) is statisfied under the construction of our Sieve space.
Condition Compact Sieve Space is trivial.
The condition Uniform Convergence is also easy to verify since the entropy metric of space
Θk is finite and uniformly bounded by the entropy metric of Θ. Therefore we have stochastic
equi-continuity, i.e., supθ∈Θk |Ln (θ) − L(θ)| = Op ( √1n ).
In (3), ηk is defined as the error in the maximization procedure. Since δ is constant, as soon
as our maximizer θ�k satisfies Ln (θ�k ) − supθ∈Θk Ln (θk ) → 0, (3) is verified.
Hence, the Sieve MLE estimator is consistent.
�
Proof of Lemma 10.
Proof. This theorem follows directly from Lemma 8. If the neighborhood Nk of θ0 is chosen as
Nk := {θ ∈ Θk : |θ − θ0 |2 ≤ µkλk },
´
2
0 )−g(y|x,θ))
with µk being a sequence converging to 0, then the non-linear term in Ex [ Y (g(y|x,θ
dy]
g(y|x,θ0 )
is dominated by the first order term θI k θ� .
�
Proof of Theorem 1.
Proof. Suppose θ = (β(·), σ) ∈ Θk is the Sieve estimator. By Lemma 8, the consistency of
Sieve estimator shows that ||θ − θ0 ||2 →p 0.
Denote τi = i−1
k , i = 1, 2, ..., k
First order condition gives:
and
En [−xf � (y − xβ(τi )|σ)] = 0.
(B.10)
En [gσ (y|x, β, σ)] = 0.
(B.11)
Assume that we choose k = k(n) such that:
(1)k λ+1 ||θ − θ0 ||2 →p 0.
(2)k r ||θ − θ0 ||2 → ∞ with p → 1.
32
HAUSMAN, LUO, AND PALMER
And assume that r > λ + 1. Such sequence k(n) exists because Lemma 7 shows an upper
−
1
bound for ||θ − θ0 ||, i.e., ||θ0 − θ||2 �p n 2(2+λ) .
From the first order condition, we know that:
En [(
fτ1 (y|x, θ), fτ2 (y|x, θ), ..., fτk (y|x, θ) gσ (y|x, θ)
,
)] = 0.
g(y|x, θ)
g(y|x, θ)
Denote fθk = (fτ1 (y|x, θ), fτ2 (y|x, θ), ..., fτk (y|x, θ)).
Therefore,
0 = En [(
fθ ,0 gσ,0
fθ ,0 gσ,0
f θk g σ
fθ g σ
, )] = (En [( k , )] − En [( k ,
)]) + En [( k ,
)].
g g
g g
g0
g0
g0
g0
The first term can be decomposed as:
En [(
fθ ,0 gσ,0
fθk g σ
fθ g σ
fθ g σ
, )] − En [( k , )] + En [( k , ) − ( k ,
)].
g g
g0 g0
g0 g0
g0
g0
By C10, we know that CLT holds pointwisely in the space Θk for En [(
f θk g σ
g0 , g0 )].
√
n(En [ fgu ]
Define Σ := U [0, 1] × D. Consider a mapping H : Σ �→ R,H((u, θ)) =
− E[ fgu ]).
By Donskerness of D and U [0, 1], Σ is Donsker. By C10, we have pointwise CLT for H((u, θ)).
By condition C2, the Lipshitz condition garantees that E[| fgu (θ) − fgu (θ� )|2 ] � ||θ − θ� ||22 .
Therefore, stochastic equicontinuity holds: for γ small enough,
P r(
sup
{|u−u� |≤γ,||θ−θ � ||
2
≤γ,θ,θ� ∈Θ}
|H((u, θ)) − H((u� , θ� ))| ≥ γ) → 0.
Similarly we have stochastic equicontinuity on
f θk g σ
fθk ,0 gσ,0
g0 , g0 ) − ( g0 , g0 )]
= op ( √1n ).
Notice E[(
(
fθk ,0 gσ,0
g0 , g0 )]
= 0, so En [(
√
nEn [ ggσ ].
f θk g σ
fθk ,0 gσ,0
g0 , g0 ) − ( g0 , g0 )]
f θk g σ
f θk g σ
g , g )] − En [( g0 , g0 )] + op (1).
fθ
g
En [( gk,0
, gσ,0
)], we can establishes
0
0
= op ( √1n ) + E[(
f θk g σ
g0 , g0 ) −
Therefore, the first term = En [(
Similarly, for the second term,
En [(
where Gn (·) :=
Therefore,
√
fθk,0 gσ,0
1
1
,
)] = √ Gn (τ1 , τ2 , ..., τk ) + op ( √ ) ∈ Rkrdx +dp ,
g0
g0
n
n
nEn [( fgτ0 |0≤τ ≤1 ,
En [(
a functional CLT:
gσ,0
g )]
is weakly converging to a tight Gaussian process.
fθk g σ
fθ g σ
1
, )] − En [( k , )]] = − √ Gn (τ1 , τ2 , ..., τk )(1 + op (1)).
g g
g0 g0
n
Similarly by stochastic equicontinuity, we can replace fθk with fθ0 and gσ with σ0 on the left
hand side of the above equation. And now we get:
E[(fθ0 , gσ0 )(
g − g0
1
)] = √ Gn (τ1 , τ2 , ..., τk )(1 + op (1)),
2
n
g0
(B.12)
ERRORS IN THE DEPENDENT VARIABLE OF QUANTILE REGRESSION MODELS
33
where g − g0 = g(y|x, θ) − g(y|x, θ0 ) := g(y|x, θ) − g(y|x, πk θ0 ) + (g(y|x, πk θ0 ) − g(y|x, θ0 )).
1
By the construction of Sieve, g(y|x, πk θ0 ) − g(y|x, θ0 ) = O( kr+1
), which depends only on the
order of Spline employed in the estimation process.
The first term g(y|x, θ) − g(y|x, πk θ0 ) = (θ − πk θ0 )(fθ�, gσ� )� ,
where θ� and σ
� are some value between θ and πk θ0 .
Denote I�k = E[(fθ0 , gσ0 )� (fθ�, gσ� )].
The eigenvalue norm ||I�k − Ik ||2 = ||E[(fθ0 , gσ0 )� (fθ� − fθ0 , gσ� − gσ0 )]||2 .
While ||fθ� − fθ0 , gσ� − gσ0 ||2 � ||θ − πk θ0 ||2 + ||πk θ0 − θ0 ||2 � ||θ − πk θ0 ||2 (By the choice of
k, ||θ − πk θ0 ||2 ≥ ||θ − θ0 ||2 − ||πk θ0 − θ0 ||2 . The first term dominates the second term because
O( k1r ) � ||πk θ0 − θ0 ||2 and k r ||θ − θ0 ||2 →p ∞.)
Since supx,y ||(fθ0 , gσ0 )||2 � k, ||I�k − Ik ||2 � k||θ − πk θ0 ||2 .
Denote H := I�k − Ik , so
||H||2 � k||θ − πθ0 ||2 � ||θ − θ0 ||2 �p
1
.
kλ
(B.13)
I�k−1 = (Ik + H)−1 = Ik−1 (I + Ik−1 H)−1 .
By (J.13), ||Ik−1 H||2 ≤ k λ o(1)
= op (1). Therefore, I + Ik−1 H is invertible and has eigenvalues
kλ
bounded from above and below uniformly in k. Hence the matrix I�k−1 ’s largest eigenvalue is
bounded by Ck λ for some generic constant C with p → 1. Furthermore, I�k−1 = Ik−1 (1 + op (1)).
kλ
Apply our result in (J.12), we get: θ − πθ0 = Ik−1 √1n Gn (1 + op (1)). So ||θ − θ0 ||2 = Op ( √
).
n
Moreover, under condition C.11,
||θ − θ0 ||2 �p
Since at the beginning we assume: (1) k λ+1 ||θ − θ0 ||2 → 0. (2) k r ||θ − θ0 ||2 → ∞ with p → 1,
and ||θ0 − πθ0 ||2 = Op ( k1r ).
kλ
Now we prove the condition that ||θ − θ0 ||2 = Op ( √
)
n
the regularity condition for k is:
2λ+1
(a) k √n → 0.
λ+r−α
(b) k √n → ∞.
If k satisfies growth conditions (1) and (2), the Sieve estimator θk satisfies:
kλ
1
kλ
θk − θ0 = θk − πθ0 + πθ0 − θ0 = Op ( √ ) + O( r ) = Op ( √ ).
k
n
n
Under condition C.11,
1
k λ−α
1
θ − πθ0 = Ik−1 √ Gn (1 + op (1)) �p √ � r .
k
n
n
Therefore,
1
θ − θ0 = Ik−1 √ Gn (1 + op (1)).
n
And pointwise asymptotic normality holds.
34
HAUSMAN, LUO, AND PALMER
For parameter σ, we would like to recall Chen’s (2008) result. In fact, we use a result
slightly stronger than that in Chen (2008), but the reasoning are similar. For k satisfying
growth condition in (1), we know that L(θ) − L(θ0 ) � ||θ − θ0 ||k , where θk ∈ Dk ∩ Nk and || · ||k
is defined as < ·, I k · >.
For a functional of interest, h : Θ :→ R, define the partial derivative:
∂h
h(θ0 + t[θ − θ0 ]) − h(θ0 )
:= lim
.
t→0
∂θ
t
Define V̄ be the space spanned by Θ − θ0 .
∗
By Riesz representation theory, there exists a v ∗ ∈ V̄ such that ∂h
∂θ [θ − θ0 ] =< θ − θ0 , v >,
where < ·, · > is induced by || · ||.
Below are five conditions needed for asymptotic normality of h(θ).
0)
ω
Condition C 16. (i)There is ω > 0 such that |h(θ − θ0 − ∂h(θ
∂θ [θ − θ0 ])| = O(||θ − θ0 || )
uniformly in θ ∈ Θn with ||θ − θ0 || = o(1);
0)
(ii)|| ∂h(θ
∂θ || < ∞.
−1
(iii)there is πn v ∗ ∈ Θk such that ||πn v ∗ − v ∗ || × ||θ�n − θ0 || = op (n 2 ).
This assumption is easy to verify because our function h is a linear function in σ: h(θ) = δ � σ.
Therefore the first and third condition is automatically verified. The second condition is true
� ]−1 δ). Denote I := E[g g � ]
because of assumption C.6. So v ∗ = (0, E[gσ,0 gσ,0
σ
σ,0 σ,0
Let εn denote any sequence satisfying: εn = o(n−1/2 ). Suppose the Sieve estimator converges
to θ0 faster than δn .
Define the functional operator µn = En − E.
Condition C17.
sup
θ∈Θk(n) ;||θ−θ0 ||≤δn
µn (l(θ, Z) − l(θ ± εn πn v ∗ , Z) −
∂l(θ0 , Z)
[±εn πn v ∗ ]) = Op (ε2n ).
∂θ
In fact this assumption is the stochastic continuity assumption. It can be verified by PDonskerness of Θ and L1 Lipshitz property that we assume on the function f .
Denote K(θ1 , θ2 ) = L(θ1 ) − L(θ2 ).
Condition C18.
E[
�
∂l(θ�n , Z)
[πn v ∗ ]] =< θ�n − θ0 , πn v ∗ > +o(n−1/2 ).
∂θ
�
�
�
�
θn ) −1
θn )
We know that E[ ∂l(θ∂θn ,Z) [πn v ∗ ]] = E[ gσ (y|x,
Iσ δ] = E[ gσ (y|x,
]Iσ−1 δ.
g
g
2λ+1
kλ
We know that |θ�n − θ0 |2 = Op ( √
) and k √n → 0, so |θ − θ0 |22 = op (n−(λ+1)/(2λ+1) ) =
n
o(n−1/2 ).
�
θn )
And we know that E[ gσ (y|x,
] = 0. So
g0
ERRORS IN THE DEPENDENT VARIABLE OF QUANTILE REGRESSION MODELS
35
gσ (y|x, θ�n )
1
1
] = E[gσ (y|x, θ�n )( − )].
g
g g0
By L1 Lipshitz condition |g−g0 | ≤ C1 (x, y)|θ−θ0 |2 , and |gσ (|θ�n )−gσ(|θ0 )| ≤ C2 (x, y)|θ−θ0 |2 .
E[
And such coefficients has nice tail properties. Therefore,
1
1
∂g/∂θ
E[gσ (y|x, θ�n )( − )] = E[gσ (y|x, θ0 )( 2 )(θ�n − θ0 )] + O(|θ�n − θ0 |22 ).
g g0
g
�
Then E[ ∂l(θ∂θn ,Z) [πn v ∗ ]] = E[(θ�n − θ0 )� ∂g/∂θ
gσ (y|x, θ0 )� δ] + op (n−1/2 ).
g2
The first part of the equation above is < θ�n − θ0 , πn v ∗ >.
Condition C19. (i) µn ( ∂l(θ∂θ0 ,Z) [πn v ∗ − v ∗ ]) = op (n−1/2 ).
(ii) E[ ∂l(θ∂θ0 ,Z) [πn v ∗ ]] = o(n−1/2 ).
(i) is obvious since v ∗ = πn v ∗ . (ii) is also obvious since it equals to 0.
Condition C20.
Gn {
∂l(θ0 , Z) ∗
[v ]} →d N (0, σv2 ),
∂θ
withσv2 > 0.
This assumption is also not hard to verify because ∂l(θ∂θ0 ,Z) [v ∗ ] is a random variable with
variance bounded from above. It is not degenerate because E[gσ gσ� ] is positive definite.
Then by theorem 4.3 in Chen (2008), for any σ, our sieve estimator θn have the property:
√
nδ � (σn − σ0 ) →d N (0, σv2 ).
√
Since the above conclusion is true for arbitrary δ, n(σn − σ0 ) must be jointly asymptotic
√
normal, i.e., there exists a positive definite matrix V such that n(σn − σ0 ) →d N (0, V ).
�
B.4. Lemmas and Theorems in Appendix I. In this subsection we prove the Lemmas and
Theorems stated in section 5.
Proof of Theorem 2.
Proof. For simplicity, we assume the φε (s) is a real function. The proof is similar when it is
complex. Then,
ˆ q
1
sin((y − xβ)s) 1
F (xβ, y, q) = − τ +
ds.
2
πs
φε (s)
0
Under condition OS,
|F (xβ, y, q)| ≤ C(q λ−1 + 1) if λ ≥ 1, for some fixed C and q large enough.
Under condition SS, |F (xβ, y, q)| ≤ C1 exp(q λ /C2 ) for some fixed C1 and C2 .
´q
ds
Let h(x, y) := E[ 0 n cos((y−xβ)s)
π
φε (s) ].
36
HAUSMAN, LUO, AND PALMER
´q
ds
By change of integration, E[ 0 n cos((y−xβ)s)
π
φε (s) ] ≤ C.
Let σ(x)2 = E[F (xβ, y, τ, q)2 |x].
Under condition OS, by similar computation, C1 q 2(λ−1) ≤ σ(x)2 ≤ C2 q 2(λ−1) , and under
condition SS, C1 exp(q λ /C3 ) ≤ σ(x)2 ≤ C2 exp(q λ /C4 ), for some fixed C1 , C2 , C3 , C4 .
For simplicity, let m(λ) = q λ−1 if OS condition holds, and let m(λ) = exp(q λ /C) if SS holds
with constant C.
β is solving:
h(x)
F (xβ, y, τ, q)] = 0.
σ(x)2
Adopting the usual local expansion:
En [x
(B.14)
h(x, y)
h(x, y)
h(x)
(F (xβ, y, τ, q)−F (xβ0 , y, τ, q))] = −{En [
(F (xβ0 , y, τ, q)]−E[
(F (xβ0 , y, τ, q)]}
2
2
σ(x)
σ(x)
σ(x)2
(B.15)
h(x, y)
−E[
(F (xβ0 , y, τ, q)].
σ(x)2
The left hand side equals:
En [
h(x, y)2
q2
](1
+
o
(1))
=
E[l(x, q)x� x](β − β0 )(1 + op (1)).
p
σ(x)2
m(λ)2
l(x, q) is a bounded function converging uniformly to l(x, ∞).
The first term of the RHS is:
1
h(x, y) f (xβ, y, τ, q)
1
− √ Gn [x
] = Op ( √
).
σ(x)
σ(x)
n
nm(λ)
(β − β0 )E[x� x
The second term of the RHS is:
h(x)
q
E[
(F (xβ0 , y, τ, q)] = O(
).
2
σ(x)
m(λ)2
√ ) + O( 1 ).
Therefore, β − β0 = Op ( m(λ)
q
n
1
Under SS, the optimal q level is log(n) λ , and the optimal convergence speed is therefore
−1
1
log(n) λ . Under OS condition, the optimal q = n 2λ , and the optimal convergence speed is
1
�
1 .
n 2λ
Download